Scraping for training generative AI models. How participants on different sides of the process should interact

Generative models require large amounts of data for training. Depending on the needs, website scraping is often used to collect data.

Scraping for training generative AI models. How participants on different sides of the process should interact

Scraping is a method of automated data processing from websites that allows you to quickly collect large amounts of information using special services, browser extensions, applications, and separate libraries. These tools read web pages for further use in analytics, marketing, machine learning, or monitoring trends, prices, etc. This “advanced copy-paste” helps automate the collection and accumulation of large amounts of data with relatively little effort.

In this article, I will discuss how to scrape as much as possible while remaining legally compliant in terms of privacy, and how to build relationships between AI developers and website owners that take into account the interests of both parties.

Legal regulation

The more technologies emerge, the more norms will be applied by analogy. Regulating the sphere “in advance,” before technologies emerge, will make such decisions adaptable to circumvent legislation, and the laws themselves ineffective.

There are no laws that directly regulate scraping.

The most relevant acts that affect the ways and possibilities of scraping are acts in the following areas:

  • privacy;
  • intellectual property;
  • technology deployment and risks in the digital age (DMCA and CFAA);
  • AI itself (European AI Act);
  • as well as self-regulation of the AI industry.

Together, they create a “go” and “no go” framework for AI developers and even users of AI systems.

Since the domestication of generative AI services, there has been ongoing debate about the ethics of development and legal loopholes, which often boil down to intellectual property rights violations. The first disputes over scraping began to emerge two to three years ago, and now courts are handing down their first final decisions. They often recognise scraping as a violation, for example:

  • copyright infringement – LinkedIn v. hiQ Labs, Thomson Reuters v. Ross Intelligence, Facebook, Inc. v. Power Ventures, Inc., etc., X Corp. v. Bright Data Ltd., NYT v. OpenAI;
  • violation of website terms of use – Canadian Legal Information Institute (“CanLII”) v. Clearway Management Ltd. and others;
  • violation of “browsewrap” and “clickwrap” agreements – X Corp. v. Bright Data Ltd.

NYT v. OpenAI is still awaiting a final decision, and the dispute between Meta Platforms, Inc. v. BrandTotal Ltd. was recently settled by a confidential settlement agreement. No matter how different the parties’ initial claims or final decisions may be, all these cases shape business practice and legal “tolerance” towards scraping.

Roles of participants & grounds for processing

When analysing the GDPR or UK GDPR, we can see the same logic:

  • The developer must be registered in the EU/UK or scrape websites that contain personal data of EU/UK residents in order to be subject to EU or UK law.
  • Website owners will be subject to the GDPR or UK GDPR when they process (collect, publish, etc.) data of EU or UK residents or are themselves registered in the EU or UK, regardless of where their users are located.

Role of Developers

Developers will be data controllers if they independently determine the scope of personal data and the means by which they collect data for training their models. Regulators currently consider legitimate interest to be the most appropriate basis for processing. This requires developers to document their processes and conduct two assessments:

  • Data Protection Impact Assessment (DPIA) – to assess the risks of potential data processing, and
  • Legitimate Interests Assessments (LIA) – to assess the use of legitimate interest as a basis for processing.

The UK regulator ICO allows the use of legitimate interest as a legitimate basis for processing. In the EU, the EDPB (an advisory body on privacy) has published a guidance decision, which also recognises that developers can rely on legitimate interest if they can demonstrate that:

  • their interests will override the interests and rights of data subjects;
  • data scraping is the only way to achieve the purpose (i.e., to collect this specific type of data for a training dataset);
  • the interest in processing must be legitimate, not hypothetical (speculative).

However, the Dutch regulator (AP) had previously argued that legitimate interest would not be a sufficient ground, as the interest of developers as a purely commercial interest would not outweigh the risks to the rights and freedoms of data subjects. For unknown reasons, the AP’s clarification is not available on its website. On the one hand, this may indicate a change in the regulator’s position. On the other hand, discussions in this direction do not in any way weaken the requirements for documenting DPIA and LIA to justify legitimate interest.

Role of website owners

Website owners usually decide for themselves how much data to display on their websites and ensure the security of their websites and, henceforth, the security of the data itself. Therefore, they fall under the definition of a controller. They will often publish user data on the basis of a contract or law, less often if the user has given consent, or there is a legitimate interest.

For the most part, website owners are not interested in scraping. Their reluctance is particularly understandable when the website contains many visual objects or other copyrighted works. However, the market is free and competitive, so website owners can meet developers halfway by providing data for scraping. The Canadian regulator (OPC) notes that this is the same “working model” for scraping for commercial or socially useful purposes, provided that the website owner can:

  • prove the legitimacy of the grounds for scraping;
  • transparently inform users about it;
  • obtain additional consent from people whose data will be processed in this way, if necessary.

The OPC also recommends providing scrapers with access via API, as this will allow website owners to have more control over their data.

Interests of data subjects

In the relationship between developers and website owners, ordinary users are not entirely independent participants. They usually do not suspect that their data is being scraped, and the likelihood of accidentally recognising their own data in AI results is very low. It follows that website owners unwittingly represent the interests of users, are responsible for their safety, and, even more so, for the realisation of their rights.

If website owners are interested in scraping, they must ensure a legal basis, not conceal their intentions, and inform users about it. If owners are against scraping, they will have to integrate multi-level security.

Under the GDPR, controllers and processors must implement administrative and technical security measures appropriate to the level of risk. From an administrative standpoint, website owners can include a provision prohibiting scraping in their terms of use. Such provisions will be legally binding, and their violation will constitute a breach of contract (terms of use). However, in practice, this provision will not be effective—it will not protect the website from scraping. Website owners should simultaneously implement technical measures that prevent parsing and clearly define the terms of use that may facilitate the enforcement of content rights in the future. Equally important, a combination of different practices will both reduce the risk of scraping and protect against potential fines.

In 2023, 16 regulators (Australia, Canada, the United Kingdom, China, Norway, Switzerland, New Zealand, Colombia, Morocco, Jersey, Argentina, Mexico, Spain, Guernsey, Monaco, and Israel) published a joint position on scraping and privacy protection. These regulators recognised that publicly available data published on websites is still subject to personal data protection, and that the mass extraction of personal data containing personal information may be considered a data breach. In contrast, the position paper referred to the combination of administrative and technical security measures as “multi-layered technical and procedural controls.”

Therefore, dividing technical solutions into infrastructure security sectors, it is recommended:

  • to use firewalls and block suspicious traffic to the website;
  • to restrict and control access to the API;
  • to restrict JavaScript on the client side;
  • to secure data in mobile applications.

The French regulator, CNIL, also advises:

  • use robot.txt or ai.txt files to prevent parsing;
  • apply anonymisation and pseudonymization to collected data;
  • restrict access to website content for users without an account;
  • minimise the types of data collected by the website owner in general, or filter and delete unnecessary categories immediately after collection (banking transactions, geolocation) or before publication (confidential data, data of minors), if they are not necessary.

Putting the puzzle together

In the scraping process, website owners and developers have almost mirrored responsibilities. If the former have decided to restrict access to data and have set technical restrictions on the website, if their content is protected by copyright or contains personal data, developers must understand these restrictions.

Moreover, obstacles at the information gathering stage complicate descriptive and operational work, where the legality of data use must be justified. Conversely, if website owners are transparent about their intentions to sell information, communicate this to data subjects, and provide all opt-out options, developers get the green light. Obtaining consent at the outset not only simplifies access to data but also facilitates the necessary assessments and permissions, as there is no need to speculate about the public interest or devise a potential defence strategy in the event of a dispute.

2 Subscribe to the news