Should the GDPR Prohibit AI?

Mikolaj Barczentewicz

2 months ago

The European Data Protection Board’s (EDPB) Nov. 5 stakeholder consultation on AI models and data protection—organized to gather input for an upcoming Irish Data Protection Commission opinion under Article 64(2) of the General Data Protection Regulation (GDPR)—showcased significant lingering disagreement on how the GDPR should apply to AI.

While the event was not intended to tell us much about which direction the EDPB will take, it helped to identify some of the relevant positions in the broader debate. Notably, some activists advocate interpreting the GDPR to effectively prohibit some AI research and business applications, including some that are already used daily by millions of Europeans.

AI Models and the Definition of ‘Personal Data’

Participants were first asked to discuss technical methodologies to evaluate whether AI models trained on personal data continue to process such data; tools and approaches to assess risks related to data extraction and regurgitation; and preventive measures to protect personal data in AI models, including both upstream and downstream controls.

A central disagreement quickly emerged during this initial discussion regarding whether AI models include, or store, personal data. Some participants argued that models like large language models (LLMs) should not be considered databases of personal data post-training, as they function as mathematical models that learn statistical patterns, rather than store specific information.

This perspective was strongly challenged by some NGO representatives, who noted that personal data can be extracted from models through specific prompting techniques. They characterized the situation as an ongoing “arms race” between those trying to protect personal data and those finding new ways to extract it, arguing that the only reliable solution would be to avoid using personal data altogether. It was even suggested that personal data should not be allowed at any stage in AI development and use (which, I’d say, would be tantamount to a prohibition, at least of LLMs).

Another significant point of contention emerged regarding what testing scenarios should be considered when evaluating personal data in AI models. Some participants argued that only normal, intended usage patterns should be considered in this assessment. Others strongly disagreed, arguing that potential hacking attempts and adversarial techniques must also be considered when evaluating whether a model processes personal data.

My own contribution to the discussion focused on generative AI models—and LLMs, in particular—as I emphasized prompting as a technical method to study the models themselves. An important nuance that might be lost: the fact that we can get personal data in the outputs of a model does not necessarily mean the model is processing personal data. Indeed, this can happen for at least two other reasons:

Users might be providing personal data directly in their prompts; and
Models often generate outputs that appear to be personal data—whether true or fictitious—based not on “memorized” information, but on statistical probability.

For example, if asked about “John Smith,” a model might assume the person is male and, if given a job title, might infer a probable birth year based on implied seniority.

I acknowledged that there are cases where models output personal data not implied by prompts. I illustrated how this often occurs with public figures, where information appears so frequently in training data that it becomes represented in model weights and can be retrieved through specific prompting. I noted that this raises interesting parallels with Court of Justice of the European Union (CJEU) jurisprudence on search engines and the Google Spain case regarding freedom of information about public figures.

I also, however, noted a more nuanced case: we might also be able to get personal data in model outputs for individuals who frequently appear in training data, despite not being traditional public figures—e.g., someone with multiple public websites. It’s difficult to assess how common this phenomenon is, as it might be so rare within the overall model that it should be considered incidental, rather than systematic.

These technical observations don’t automatically determine whether we should consider models to be storing personal data in model weights, from the perspective of the GDPR. That remains a separate legal question that requires careful consideration. One relevant issue for this analysis is whether processing personal data is incidental to the training, deployment, and use of AI models. I recommend Peter Craddock’s article on this issue, but note that this is controversial.

One discussant suggested that, to determine whether processing of personal data takes place, one should not focus on what is statistically likely, but on outliers. An example was given of a journalist who covered criminal cases and was himself later associated in model outputs with crime. On this view, even a vanishingly small number of such cases in a model with hundreds of billions or more parameters—a drop in the ocean—would suffice for the GDPR to apply. But such a reading would leave very little of our lives unregulated by the GDPR, leading to a bureaucratic nightmare that would utterly discredit the law. The law of everything risks being the law of nothing.

Regarding how to assess the risks of regurgitation and “extracting” personal data, I focused on the post-training phase of generative AI models and LLMs, when only model weights are accessible. I’m not aware of any comprehensive technique that would allow us to identify all—or even most—instances where personal data-generation is possible without being implied by the prompt. While one could attempt to probe models with names or identifying information about various individuals—and while there are documented “magic” prompts that sometimes yield results—these methods are limited. At best, they reveal only isolated examples of personal data.

This approach also risks being very hit-or-miss, making it difficult to reach a meaningful assessment of risk or probability. The fundamental challenge is that, while you might be able to prove that some personal data can be generated in model outputs, that wouldn’t tell us whether such occurrences are merely incidental or indicative of a broader issue.

Major AI developers are already implementing automated methods to remove personal data from training datasets, such as identifying and filtering out information associated with likely forenames, surnames, and dates of birth. But expecting the perfect removal of personal data from LLM training data would be disproportionate, even at the stage of the jurisdictional question of whether personal data is processed within the meaning of the GDPR. While standards to minimize the risk of personal data in training data would be useful, it might be more effective to set behavioral standards, rather than establishing specific metrics to be achieved. Risk management might be better accomplished through accountability in the model-development process, rather than through post-hoc assessment.

Several participants also emphasized the importance of distinguishing between different types of AI models, noting that—while much of the discussion has focused on LLMs—other applications (such as medical imaging) present distinct considerations. They also highlighted the importance of considering different phases: development, training, and deployment.

Legitimate Interest as Legal Basis

The EDPB’s second session focused on Article 6(1)(f) GDPR’s legitimate-interest provision as a potential legal basis for AI model development and deployment. Discussions centered on:

In the training phase, measures to balance controller interests against data-subject rights, distinctions between first-party and third-party personal-data processing, and practical safeguards and controls during model creation and training; and
In the post-training phase, deployment and re-training considerations, the impact of potentially unlawful initial training on subsequent model use, and the ongoing monitoring and adjustment of balancing measures.

A key discussion point concerned the role of data-subject rights in the balancing test. Some participants, particularly from industry, emphasized the technical challenges in implementing certain rights, especially in the context of AI model training. They argued that retraining models for individual objection requests would be disproportionately costly and technically challenging.

Others, primarily from civil-society organizations, countered that these technical limitations shouldn’t override fundamental rights guaranteed by the GDPR. They noted that, if companies benefit economically from AI, they must also bear the costs of respecting data-subject rights.

A notable point of contention emerged around the timing of objection rights. Industry representatives generally favored an ex-ante approach, where individuals could opt out before training begins. Some participants, however—particularly privacy advocates—argued that this approach fundamentally misunderstands Article 21 GDPR, which provides for objection rights after processing begins under legitimate interest. They maintained that moving the objection right to before processing would effectively nullify its purpose under the GDPR’s structure.

On the question of necessity versus balancing, several argued that many of the technical measures under discussion (such as data minimization, synthetic data use, and privacy-enhancing technologies) actually belong to the necessity test, rather than the balancing exercise. This distinction is potentially significant, because necessity is a prerequisite that must be satisfied before reaching the balancing stage. Whether the test of proportionality tends to be applied in such a neat way in EU law is, however, a different question.

I emphasized two main points regarding the legitimate-interest balancing test for AI model training. First, we must approach the question of balancing in the context of what legitimate interests controllers can rely upon. I stressed that the best interpretation of the GDPR would be one that fully aligns with Article 52(1) of the Charter of Fundamental Rights, taking into account not only privacy and data-protection rights, but also freedom of expression and information, among others.

Drawing parallels with case law, I pointed to how the CJEU has approached internet search engines, both in Google Spain and more recent cases. Controllers should be able to rely not only on commercial interests, but also on considerations similar to those discussed by Advocate General Niilo Jääskinen in Google Spain regarding search engines—specifically, regarding how AI-based services facilitate Europeans’ freedom of expression and information. There is a compelling case that AI tools are not only already important for Europeans, but are likely to become even more pivotal than search engines. Any GDPR interpretation that fails to account for this would be incompatible with the charter.

On the practical distinction between first-party and third-party data, AI developers processing first-party data may find it easier to rely on legitimate interest as a legal basis. The direct relationship with data subjects provides practical advantages for implementing safeguards and respecting rights. For example, having a web account infrastructure makes it more straightforward to facilitate right-to-object requests. This direct relationship also enables more effective communication with data subjects about processing activities and their rights.

Some participants suggested that, because it may be possible for AI developers to ask users for prior consent in first-party contexts, that should mean that those developers must rely on consent, not legitimate interest. But this seems pretty clearly an attempt to smuggle in the priority of consent over other legal bases in Article 6 GDPR without any grounds in the GDPR. One should also note that we are talking about situations where a data subject would be considered sufficiently protected under a legitimate-interest basis in a third-party context, but would suddenly need consent in an otherwise identical first-party situation. Among other problems, this appears to be an unprincipled departure from equality before the law (among third-party and first-party AI developers).

While robust safeguards are essential, we must avoid an overly expansive interpretation of the GDPR that would effectively prohibit beneficial AI applications. The goal should be to find practical solutions that protect individual rights, while enabling the development of technologies that can benefit society as a whole. I hope that the EDPB will be able to find the right balance in their upcoming opinion.

AI Models and the Definition of ‘Personal Data’

Legitimate Interest as Legal Basis

Share this: