Cosmin Marginean

September 30, 2023

Entity identifiers in UK Public Procurement data

I wrote in the past about the importance of providing Companies House registration numbers with any public records that refer to UK-registered companies.

Contracts Finder is one dataset that contains some Companies House references within its records. As part of the BODS Risk Detection PoC, we've used this dataset to link UK public contracts data with beneficial ownership records from the Open Ownership register. We've identified at the time some limitations when it comes to the availability of Companies House identifiers.

For reference, the Contracts Finder dataset uses the GB-COH namespace (GB-COH-<REGNO>) for referring to Companies House records. The presence of this identifier allows us to reliably link a public contract supplier with an entity listed in other datasets, including the Open Ownership register.

During the development of the PoC, we've identified other concerns worth exploring, and we believe there's a case for improving the way public procurement records are handled. This analysis builds the argument for dealing with supplier identifiers within these records.

The Contracts Finder data

Before we look into the analysis, it's worth discussing the contents and structure of this dataset.

At the time of this research (Sept 2023), the Contracts Finder dataset refers to 308,971 public contracts awarded to various suppliers. It only captures records published from 2015 onwards - data from 2011 is also available in other formats.

A public contract can be awarded to multiple suppliers. I'm not familiar with the depths of public procurement processes, but this is presumably to cover the case for an initiative with multiple deliverables.

At a data level, this is represented through the award and supplier concepts. A public contract can have multiple awards, and an award can refer to multiple suppliers.


It's worth noting that less than 1% of the contracts (1,365 of 308,971) contain multiple award records and only about 5% (16,306) contain awards with multiple suppliers. However, it's worth accounting for this possibility when modeling against this data.

Another aspect here is that suppliers can be awarded various contracts over the years, which raises the concern of identifying unique suppliers across multiple contracts. We'll come back to this in a bit.

Analysis lenses


These details are essential for making sense of the numbers, as there are several lenses through which we can look at this data.

Firstly, at a contract record level, we can check how many refer to suppliers with GB-COH identifiers: 79,375 out of the 308,971 records (just over 25%) fall into this category.

The small print here is that we consider a record to match this criteria, if any (not necessarily all) of the suppliers listed against that contract have a GB-COH ID. We can probably discover deeper subtleties by splitting this aggregation further, but that's less material to the argument.



Given the contract-award-supplier structure discussed earlier, another way to analyse this data is by the number of awards. But because most contracts have a single award record, this doesn't dramatically affect the proportions: 82,402 of 314,264, just over 26%.

Lastly, there's the matter of unique IDs for the suppliers, where the numbers look more optimistic: 97,585 of 237,874 suppliers (about 41%) are referenced with their company registration numbers.

"Unique" supplier IDs


Many of the records in Contracts Finder use a GB-CFS namespace for the supplier IDs. 

{"id":"GB-CFS-186058","name":"CFH DOCMAIL LTD"}
{"id":"GB-CFS-161004","name":"ORTUS ECONOMIC RESEARCH LTD"}

It's unclear how (or if at all) these can be dereferenced for additional details, but it's worth noting that there are many inconsistencies across the dataset for these identifiers. 

There seems to be some re-use of these IDs when the same entity is awarded a new contract. But we've encountered many examples where (what seems to be) the same entity is referenced with different GB-CFS identifiers.

{"id":"GB-CFS-183074","name":"10X GENOMICS"}
{"id":"GB-CFS-137874","name":"10X GENOMICS"}
{"id":"GB-CFS-85773","name":"10X GENOMICS INC"}
{"id":"GB-CFS-236315","name":"10X GENOMICS INC"}
{"id":"GB-CFS-173034","name":"10X GENOMICS INC"}
{"id":"GB-CFS-113766","name":"10X GENOMICS INC"}
{"id":"GB-CFS-155809","name":"10X GENOMICS INC"}

Even worse, there are entities which are listed with GB-CFS identifiers in some records and GB-COH in others, which complicates the matter further.

{"id":"GB-CFS-83749","name":"BIFFA WASTE SERVICES LIMITED"}
{"id":"GB-COH-00946107","name":"BIFFA WASTE SERVICES LIMITED"}

In short, identifying suppliers who were awarded multiple contracts can't be done reliably. It's entirely possible to end up with two supplier records for the same company and an inaccurate picture of its contracts. This adds another layer of complexity to what's already an unpleasant problem.

Context/Namespace

It's important to know that many types of entities (like trusts, charities, local authorities, etc) can be awarded public contracts. These don't have Companies House registration numbers, and it's unclear at this stage how many of them fall into the 59% without a GB-COH ID.

It's often possible to infer the entity type from the supplier's name if, for example, it contains words like "TRUST", "CHARITY", etc. This has some value in manual investigative efforts, but it's not a reliable approach for automated data linking.

More importantly, with the exception of GB-COH identifiers, the Contracts Finder records contain no information about the entity type, or the way it can be dereferenced.

{
    "id": "GB-CFS-244478",
    "name": "Alexandra Rose charity",
    "identifier": {
        "legalName": "Alexandra Rose charity"
    },
    "address": {
        "streetAddress": "BN1 3XG"
    },
    "details": {
        "scale": "sme",
        "vcse": false
    },
    "roles": [
        "supplier"
    ]
}

For example, suppose a charity number is indeed present as a supplier ID (maybe prefixed by GB-CFS). The record would still be missing essential context for the consumer to determine that this ID can be matched against the Charities register - or any other register, for that matter.

This can probably be addressed with a simple namespacing approach for producing these IDs. It does, however, require either a) abandoning the internal GB-CFS identifiers altogether or b) maintaining some mapping between those and the external references. The latter isn't ideal, but it's workable if this mapping is public.

                                                                            

In conclusion, there are at least three key points to the discussion around Contracts Finder identifiers:
  • Consistent use of Companies House IDs where applicable.
  • Consistency of supplier IDs across records
  • Context for dereferencing the supplier

These are all equally important in addressing this gap, particularly in an age of large-scale data processing and automated record linking. Augmenting public records with external references will be essential for delivering against the principles of public sector transparency.