Two guiding principles for improving Gov.uk public datasets

I've been working with public data from gov.uk a lot over the years, and I can firmly say that Britain is setting a high standard for adopting open data principles and bringing about more transparency and accountability in public administration. The data.gov.uk is a particularly encouraging initiative as it makes it a lot clearer what data is published by various public bodies.

There is room for improvement, and many things are still missing. But I found that two aspects most commonly frustrate efforts to consume this data meaningfully and many shortcomings would be addressed by turning these two requirements into guiding principles:

1. Always publish the company registration number when referring to a company registered in the UK.

2. Make datasets available for bulk download, not just on-demand search.

This criteria is essential for cross-referencing records and offline data-linking, and very few gov.uk datasets meet these requirements. For example, some crucial publications like fines, public contracts or Environment Agency enforcement actions lack company registration numbers.

While listing a company name is often sufficient for investigative purposes and human-driven research, it's sadly unreliable for automated data linking. URIs would be ideal, but at a minimum, a registration number would be of great help when cross-referencing these records.

As for making the data available in bulk, this is relevant from a technical point of view. When implementing solutions that consume this data, it's often impractical to effect an API request to an external service to fetch details on an entity. In such cases, performing offline record linking based on custom (and often not yet defined) requirements is near impossible.

Publishing bulk datasets is understandably more challenging, especially since many of these services seem to be designed with a search-first mindset. But I believe that if Companies House managed to do this with a dataset that's millions of records, then this should be even more attainable for agencies who oversee substantially less data.

To drive the argument home, there are matters of public interest and questions which I believe we should be able to answer with minimal effort. For example, does this individual control any companies that broke money laundering regulations, and how are these companies connected to other entities? What other assets are controlled by an individual whose companies have applied for R&D tax reliefs? Are there any political donations made by companies controlled by foreign nationals?

To end on a positive note, I'd like to say that I've been interacting with these services on and off for years, and I have seen substantial progress being made consistently. We hope to see more of this. And we hope to see more care and consideration for enabling automated systems to consume this data meaningfully and in ways we haven't imagined yet.