BMJ's response to the ICO commissioner's request about GenAI

The ICO in the UK recently asked for a request for comment on GenAI and data privacy.

With my colleague Timothy Morgan - BMJ’s Data Protection Compliance Manager - we drafted a response from BMJ to this request. The below is our response.

Do you agree with the analysis presented in **this document ? **

It’s a clearly written analysis, but in our opinion, it is not complete. The analysis refers to the creation of a model, and applications built on top of the model, but what is missing is that models can have emergent properties that are not present in any individual component of the training data. Therefore, there can be aspects of the model that do not map back directly to specific training data in the way that we might see from a simple database. Depending on the size of the training set, the ability to recall the original training data, and the propensity for emergent non linear phenomena to appear changes significantly. There is an aspect of scale that we will need to address or think about that would be worth approaching through this analysis.
For the ICO’s purposes, it would appear to be relevant if properties emerge from processing of the training data, if it were to relate back to individuals, or if it might be used in any way to make decisions or predictions about individuals. However, further clarity on what this looks like in practice, and “what good looks like” would be helpful. This would probably need to be revisited regularly (at least annually) to see how data and practices develop, and to provide frameworks for attempting to anticipate compliance requirements for novel processing.

We explain in this consultation that the purposes of generative AI model development and application development should be considered to be separate purposes. Do you agree with the analysis we have presented **this document?**

Yes, we agree.

Where the organisation developing a model is separate to that developing the application based on it:
How can the model developer meaningfully communicate to the other organisation what personal data was used for the model training and why?

There are a number of options, but the two main ones are:

Make the training data available.
Make a description of the training data available.

What we do know is that models have a tendency to be quite leaky around being currently quite wiling to spit up, or memorise, samples of training data. This becomes more prominent with larger models with larger context windows. We should operate on the basis, currently, that any model that has been trained on personal data will expose a meaningful sample of that personal data if prompted in particular ways, and so the model itself should be subject to similar safeguards, and the safeguards around personal data.

Do you think the purpose of developing generative AI models requires the processing of personal data?
This will depend on the application. A model trained to generate food images does not need personal data in the training set. A model trained to work on patient health records will need patient health records as a training data set. Another way to say this is that there will be some purposes that definitely require processing of personal data.

Where the model does use data that is (or could be) personal data, it would appear to be uncommon and counterproductive to generate an entire artificial data data set for the specific purpose of training an AI model. That would typically create a less accurate, less usable AI tool. The ability to pseudonymise or possibly anonymise data before training might be helpful. However, that might entail using multiple systems, or trusting a single system to process that data without leaking it. As many AI tools do appear to be susceptible to becoming leaky, there is still a risk that is difficult to assess and address.

How can organisations who use personal data to train unspecified kinds of generative AI models comply with the purpose limitation principle?
Even where data protection law has not been considered in as much depth as would be desirable, models are typically specified quite tightly, in regards to being the result of a formal process of data preparation and training. In circumstances where the training steps themselves are be quite generic (which we understand to be quite common), the purpose will often be defined by inference, based on the data used.

Almost all uses of personal data to create a generative AI model will require a data protection impact assessment (DPIA). As part of the DPIA, the purpose(s) will have to be specified, rather than inferred. It’s hard to envisage a situation where a DPIA won’t be required, but even if it wouldn’t, then there is a legal requirement for data protection by design and default. Perhaps something in any guidance or toolkits referring back to the necessity in relation to development would be one approach to take. Guidelines around milestones and revisiting purpose limitation may be appropriate.

One thing to consider is that while the analysis presented separates purposes of application building to model training, very often a model is trained with a specific application or use in mind.

Do you consider the collection of training data and model development to fall under the same purpose?
Not necessarily. It appears that a lot of data used for training is collected for other purposes. While there are at least some instances where data has been collected specifically to train a model, our understanding is that it’s very much the minority of cases. Let us take for example the idea of training a model based on clinical records. A medical insurance company may have a legitimate purpose to collect clinical records in order to allow it to process insurance claims. It has via this purpose built up a data set that could be used to train a model to do predictive analysis on emergence of health trends. In this case the purposes of data collection and model development would be different.

What aspects of generative AI development and deployment would need to be documented to make a purpose specific enough?
Applying the principle of minimal privilege is likely to be helpful when it comes to personal data. Can the organisations and individuals training the model provide assurance that they are using just the data needed for their purpose, and no more than that? Is there a specific format that would be appropriate for doing so?

There would normally have to be a data protection impact assessment, where purposes should be clearly defined. As a living document, the DPIA would typically be revisited and updated as purposes narrow. Perhaps some sort of guidelines on use of training data as it relates to purpose limitation, and when it should be revisited. While we appreciate that the ICO isn’t going to be in a position to define a specific timeframe, given the uncertainty that exists in the field of generative AI, there are likely to be milestone-related points that would trigger a reconsideration. Guidance on what to consider when determining appropriate milestones would be helpful.

How will the proposed regulatory approach affect your organisation (If at all)?
Generative AI is likely to affect all organisations to some extent. Clarity around regulatory risks, mitigations, and requirements will help us to plan accordingly. Reuse of existing personal data is likely to become more common as generative AI continues to develop, and early implementation of best practice will help organisations, including ours, prepare and use the data lawfully.

Tags from OpenAI:

data privacy, artificial intelligence, genai, legal compliance

Chinese Summary: 英国ICO最近就GenAI和数据隐私征求意见。BMJ团队就此进行了回应，他们认为模型可能存在并未在任何个别训练数据中出现的突现属性。模型开发和应用开发应被视为两个不同的目的。对于使用个人数据的模型，应对模型本身施加相似的保护措施。如果模型使用个人数据，那么生成整个用于培训AI模型的人工数据集似乎是不常见且无效的。组织将需要进行数据保护影响评估(DPIA)，并必须明确指出目的，而不是推断出来。在大多数情况下，训练数据的收集和模型开发可能并非为同一目的。

German Summary: Das ICO in Großbritannien hat kürzlich um eine Stellungnahme zu GenAI und Datenschutz gebeten. Das BMJ-Team hat darauf reagiert und betont, dass Modelle Eigenschaften aufweisen können, die in keiner einzelnen Komponente der Trainingsdaten vorhanden sind. Die Entwicklung von Modellen und Anwendungen sollte als separate Zwecke betrachtet werden. Modelle, die persönliche Daten verwenden, sollten ähnlichen Sicherheitsvorschriften unterliegen. Wenn die Trainingsdaten persönliche Daten verwenden, ist die Erzeugung eines gesamten künstlichen Datensatzes für das Training oftmals nicht üblich und kontraproduktiv. Organisationen müssen eine Datenschutz-Folgenabschätzung (DPIA) durchführen und die Zwecke müssen spezifisch angegeben werden, nicht inferiert. Die Sammlung von Trainingsdaten und die Entwicklung von Modellen fallen nicht unbedingt unter den gleichen Zweck.

Spanish Summary: El ICO en el Reino Unido recientemente solicitó comentarios sobre GenAI y la privacidad de datos. El equipo de BMJ respondió a esta solicitud, destacando que los modelos pueden tener propiedades emergentes que no están presentes en ningún componente individual de los datos de entrenamiento. Debe considerarse que el desarrollo y la aplicación del modelo son propósitos separados. Los modelos que usan datos personales deben tener las mismas salvaguardas que los datos personales. Si el modelo utiliza datos que son (o pueden ser) datos personales, sería poco común y contraproducente generar un conjunto de datos artificiales enteros para el propósito específico de entrenar un modelo de IA. Las organizaciones necesitarán realizar una evaluación de impacto de protección de datos (DPIA) y los propósitos tendrán que ser específicos, no inferidos. La recopilación de datos de entrenamiento y el desarrollo del modelo no necesariamente caen bajo el mismo propósito.

Tags from OpenAI:

artificial intelligence, data privacy, legal compliance, genai, data protection, regulatory compliance.