Ian Mulvany

May 13, 2024

BMJ's response to the ICO commissioner's request about GenAI

The ICO in the UK recently asked for a request for comment on GenAI and data privacy.

With my colleague Timothy Morgan - BMJ’s Data Protection Compliance Manager - we drafted a response from BMJ to this request. The below is our response.

Do you agree with the analysis presented in **this document ? **

It’s a clearly written analysis, but in our opinion, it is not complete. The analysis refers to the creation of a model, and applications built on top of the model, but what is missing is that models can have emergent properties that are not present in any individual component of the training data. Therefore, there can be aspects of the model that do not map back directly to specific training data in the way that we might see from a simple database. Depending on the size of the training set, the ability to recall the original training data, and the propensity for emergent non linear phenomena to appear changes significantly. There is an aspect of scale that we will need to address or think about that would be worth approaching through this analysis.
 For the ICO’s purposes, it would appear to be relevant if properties emerge from processing of the training data, if it were to relate back to individuals, or if it might be used in any way to make decisions or predictions about individuals. However, further clarity on what this looks like in practice, and “what good looks like” would be helpful. This would probably need to be revisited regularly (at least annually) to see how data and practices develop, and to provide frameworks for attempting to anticipate compliance requirements for novel processing.

We explain in this consultation that the purposes of generative AI model development and application development should be considered to be separate purposes. Do you agree with the analysis we have presented **this document?**

Yes, we agree.

Where the organisation developing a model is separate to that developing the application based on it:
 How can the model developer meaningfully communicate to the other organisation what personal data was used for the model training and why?

There are a number of options, but the two main ones are:

  • Make the training data available.
  • Make a description of the training data available.
What we do know is that models have a tendency to be quite leaky around being currently quite wiling to spit up, or memorise, samples of training data. This becomes more prominent with larger models with larger context windows. We should operate on the basis, currently, that any model that has been trained on personal data will expose a meaningful sample of that personal data if prompted in particular ways, and so the model itself should be subject to similar safeguards, and the safeguards around personal data.

Do you think the purpose of developing generative AI models requires the processing of personal data?
This will depend on the application. A model trained to generate food images does not need personal data in the training set. A model trained to work on patient health records will need patient health records as a training data set. Another way to say this is that there will be some purposes that definitely require processing of personal data.

Where the model does use data that is (or could be) personal data, it would appear to be uncommon and counterproductive to generate an entire artificial data data set for the specific purpose of training an AI model. That would typically create a less accurate, less usable AI tool. The ability to pseudonymise or possibly anonymise data before training might be helpful. However, that might entail using multiple systems, or trusting a single system to process that data without leaking it. As many AI tools do appear to be susceptible to becoming leaky, there is still a risk that is difficult to assess and address.

How can organisations who use personal data to train unspecified kinds of generative AI models comply with the purpose limitation principle?
Even where data protection law has not been considered in as much depth as would be desirable, models are typically specified quite tightly, in regards to being the result of a formal process of data preparation and training. In circumstances where the training steps themselves are be quite generic (which we understand to be quite common), the purpose will often be defined by inference, based on the data used.

Almost all uses of personal data to create a generative AI model will require a data protection impact assessment (DPIA). As part of the DPIA, the purpose(s) will have to be specified, rather than inferred. It’s hard to envisage a situation where a DPIA won’t be required, but even if it wouldn’t, then there is a legal requirement for data protection by design and default. Perhaps something in any guidance or toolkits referring back to the necessity in relation to development would be one approach to take. Guidelines around milestones and revisiting purpose limitation may be appropriate.

One thing to consider is that while the analysis presented separates purposes of application building to model training, very often a model is trained with a specific application or use in mind.

Do you consider the collection of training data and model development to fall under the same purpose?
Not necessarily. It appears that a lot of data used for training is collected for other purposes. While there are at least some instances where data has been collected specifically to train a model, our understanding is that it’s very much the minority of cases. Let us take for example the idea of training a model based on clinical records. A medical insurance company may have a legitimate purpose to collect clinical records in order to allow it to process insurance claims. It has via this purpose built up a data set that could be used to train a model to do predictive analysis on emergence of health trends. In this case the purposes of data collection and model development would be different.

What aspects of generative AI development and deployment would need to be documented to make a purpose specific enough?
Applying the principle of minimal privilege is likely to be helpful when it comes to personal data. Can the organisations and individuals training the model provide assurance that they are using just the data needed for their purpose, and no more than that? Is there a specific format that would be appropriate for doing so?

There would normally have to be a data protection impact assessment, where purposes should be clearly defined. As a living document, the DPIA would typically be revisited and updated as purposes narrow. Perhaps some sort of guidelines on use of training data as it relates to purpose limitation, and when it should be revisited. While we appreciate that the ICO isn’t going to be in a position to define a specific timeframe, given the uncertainty that exists in the field of generative AI, there are likely to be milestone-related points that would trigger a reconsideration. Guidance on what to consider when determining appropriate milestones would be helpful.

How will the proposed regulatory approach affect your organisation (If at all)?
Generative AI is likely to affect all organisations to some extent. Clarity around regulatory risks, mitigations, and requirements will help us to plan accordingly. Reuse of existing personal data is likely to become more common as generative AI continues to develop, and early implementation of best practice will help organisations, including ours, prepare and use the data lawfully.

