Training artificial intelligence: how the DSM directive’s Text and Data Mining exception shaped the rules of AI Training and possible future outcomes

Author: Annalisa Ricchiuti, LL.M in Intellectual Property and Competition Law-Munich Intellectual Property Law Center (MIPLC), 2020-2021

Editor: Bobbie Smith, MA Geography University of Aberdeen 2016-2020 / Graduate Diploma in Law University of Exeter 2020-2022

 

The Digital Single Market Directive (DSMD)[1] of 2019 introduced new copyright rules, including provisions on copyright limitations for text and data mining (TDM)[2].

TDM is a technique that automates the analysis and extraction of information from large volumes of data or text, and that allows the discovery of patterns in existing datasets through which nontrivial predictions on new data are possible[3]. The DSMD contains two exceptions to copyright law for TDM: the first one, contained in article 3, allows TDM for non-commercial research purposes, enabling researchers to extract data and information from lawfully accessible copyrighted material without seeking permission from the copyright owner. The exception can’t be limited by the copyright owner and is therefore mandatory.

The second exception for TDM is provided by article 4, which provides for a limitation to copyright for TDM in any other case that falls outside the scope of article 3; article 4 recognizes the possibility for the copyright owner to contract out the limitation.

The directive has thus provided a new legal framework that promotes TDM while ensuring copyright owners’ rights are protected. Since the DSMD has allowed researchers to extract data and information from copyrighted material without seeking permission from the copyright owner, the TDM has been used in various industries to extract valuable information from large datasets. In recent years, AI models have been trained using TDM to recognize patterns and extract information from vast amounts of data.

One famous example of TDM being used to train AI is the development of the Natural Language Processing (NLP) model known as GPT-3. NLP is a set of computer techniques that analyze and understand human language. NLP helps computers process language in a way that’s similar to how people do it, which can be used for many different tasks and applications[4]. GPT-3 is an NLP model developed by OpenAI that uses TDM to analyze and learn from a vast corpus of text.

GPT-3 can be described as a statistical model that determines the probability distribution over a sequence of words, this meaning that the system is able to guess which text comes next when certain text is given as input[5]. The language has been trained on a diverse range of datasets, including Wikipedia, books, and scientific articles, to generate coherent and human-like responses to questions.

ChatGPT is a separate model from GPT-3, with a different architecture, training data, and level of performance. While both models were trained by OpenAI, they are designed for different purposes.

ChatGPT was designed specifically for use in conversational agents or chatbots and was trained on a smaller amount of text data than GPT-3. Its architecture is optimized for generating short responses to specific prompts or questions, rather than generating longer-form content[6].

Another example of TDM exception used for training AI is the use of a large dataset by researchers at the University of Cambridge who extracted data from over 20 million scientific articles to develop an AI tool that can predict the outcome of chemical reactions. The tool, called “Chematica”, was trained using the TDM exception and has the potential to revolutionize drug discovery and development[7].

These practical examples highlight the importance of the TDM exception and how the DSMD has played a key role in accelerating scientific research, promoting innovation, and stimulating the development of new technologies and businesses.

The TDM exception, as previously highlighted, can’t be limited or contract-out by the copyright owner in the cases outlined by Article 3, this means that the TDM exception can’t encounter limitations when conducted with the purpose of scientific research,

If the TDM exception were extended to cover commercial purposes, it would have significant implications for copyright owners who may lose control over the use of their works for commercial TDM purposes: this could impact their revenue streams since copyright protection allows them to monetize their works

On the other hand, TDM users, such as researchers and businesses, would benefit from increased access to data and information, which could be used to develop better products and services without encountering limitations. This could foster innovation and economic growth, as well as facilitate research and development by reducing the costs related to accessing large datasets.

However, an unlimited TDM exception extended to commercial purposes could result in reduced incentives for content creators to invest in the creation of new works, which could ultimately limit the availability of data and information for TDM purposes.

As we highlighted previously, the purpose of Intellectual Property Rights (IPRs) is to foster innovation by granting legal protection to the interests -primarily economic interests- of the IPRs owner, harnessing the rights in such a way could have the effect to decrease the investments to produce the works protected under copyright.

In addition, extending the TDM exception to commercial purposes could raise issues related to data privacy and security. TDM typically involves the processing of large volumes of data, which could contain sensitive personal information. The use of such data for commercial purposes could raise privacy concerns and lead to potential data breaches, which could have negative consequences for both individuals and businesses.

Overall, the extension of the TDM exception to cover commercial purposes would require careful consideration and balancing of the interests of copyright owners, TDM users, as well as the interest of the final users. While there are potential benefits associated with increased access to data and information, it is important to ensure that such access is not achieved at the expense of copyright protection and data privacy as well.

On this last point, in addition to the provisions of the Digital Single Market Directive (DSMD), the UK government has also been considering expanding the scope of unlicensed text and data mining (TDM) activities within the country.

In 2014, the UK government introduced an exception to copyright law that allowed TDM for non-commercial research purposes[8]. However, there were concerns that the exception did not go far enough in promoting TDM in the UK.

In response to these concerns, the UK government launched a consultation[9] to consider broadening the scope for unlicensed TDM activities. The government’s initial plan was to introduce a new exception to copyright law that would allow TDM for any purpose, including commercial TDM. The UK Government considered that some rights holders are willing to license their works to allow TDM, but others are not[10]. This has financial costs for people using data mining software. The new exception would have been subject to a lawful access requirement to the relevant copyright works and other protected subject matter and would have made the UK one of the most permissive jurisdictions in the world for TDM. The government’s consultation received broad support from a range of stakeholders, including the scientific community, industry groups, and copyright owners.

However, the government ultimately decided not to introduce a new exception for TDM[11].

Despite this, the UK government’s initial plan to broaden the scope for unlicensed TDM activities highlights the potential benefits of TDM for the UK’s economy and research communities. The government’s consultation showed that there is a clear demand for a more permissive legal framework for TDM, and it is possible that the government may revisit this issue in the future.

In conclusion, while the UK government did not introduce a new exception for TDM amendments to UK copyright law, the government’s initial plan to consider a more permissive legal framework for TDM highlights the potential benefits of this technology for the UK’s digital economy and research communities. The provisions of the Digital Single Market Directive provide a clear legal framework for non-commercial TDM, and it is possible that the UK government may revisit the issue of broadening the scope of unlicensed TDM activities in the future.

On the other hand, as it has emerged from the recently published report on the matter released by the House of Lords[12], the IPO’s proposal generated significant concern within the creative industries about the potential loss of revenue, without taking sufficient account of the potential harm to the creative industries[13].

Balancing IPRs and the interests of other individuals -as researchers, businesses, and society as a whole- requires the attentive and cautious intervention of Governments and national Parliaments, as well as the involvement of the relevant stakeholders and market players in the discussion.

The debate on TDM in the UK is a great example of how finding the perfect balance between different needs requires time: the discussion in the UK is still open and it will be interesting to see how it will evolve in the future.

 

List of references

[1] Digital Single Market Directive, [2019] OJ L130/1.

[2] Digital Single Market Directive, [2019] OJ L130/1, art 3 and 4.

[3] Ian H Witten and others, Data Mining (Morgan Kaufmann ed, 4th edn, 2017).

[4] Elizabeth D Liddy, Natural Language Processing in Encyclopedia of Library and Information Science (NY. Marcel Decker, Inc, 2nd Ed, 2001).

[5] Steve Tingiris and Bret Kinsella, Exploring GPT-3 (first published 2021, Packt Publishing 2021), Chapter 1

[6] Merunas Grincalaitis, ‘ChatGPT vs GPT-3 by Merunas’ (medium.com, 12 December 2022) <https://merunasgrincalaitis.medium.com/chatgpt-vs-gpt-3-by-merun-4c282c83d50a> accessed 15 February 2023.

[7] Klucznik, T, Mikulak-Klucznik, B, McCormack, M, et al, ‘Efficient Syntheses of Diverse, Medicinally Relevant Targets Planned by Computer and Executed in the Laboratory’ (2018) Chem. 4, 10.1016/j.chempr.2018.02.002.

[8] Intellectual Property Office, ‘Exceptions to copyright’ (GOV.UK, 12 June 2014) <www.gov.uk/guidance/exceptions-to-copyright#non-commercial-research-and-private-study> accessed 28 February 2023.

[9] Intellectual Property Office, ‘Artificial Intelligence and Intellectual Property: Copyright and patents: Government response to the consultation’ (GOV.UK, 28 June 2022) <www.gov.uk/government/consultations/artificial-intelligence-and-ip-copyright-and-patents/outcome/artificial-intelligence-and-intellectual-property-copyright-and-patents-government-response-to-consultation#text-and-data-mining> accessed 28 February 2023.

[10] Ibid.

[11] UK Parliament, ‘Artificial Intelligence: Intellectual Property Rights’ (UK Parliament, 1 February 2023) <https://hansard.parliament.uk/commons/2023-02-01/debates/7CD1D4F9-7805-4CF0-9698-E28ECEFB7177/ArtificialIntelligenceIntellectualPropertyRights>.

[12] UK Parliament, ‘Chapter 2: A digital future’ (www.parliament.uk, 17 January 2023) accessed 28 February 2023.

<https://publications.parliament.uk/pa/ld5803/ldselect/ldcomm/125/12502.htm> accessed 28 February 2023.

[13] Ibid.

 

This article is written within the Academic Essay Project (AEP) organised by LAWELS. AEP aims to increase the number of quality academic writings on legal topics, encourage young lawyers to participate in academic writing, and lay the foundation of an online database on legal science. The team of legal editors and legal writers share their knowledge through high-end essays that we are publishing on our website and social media accounts for the world to read and learn from.

The articles on the LAWELS platform are not, nor are they intended to be legal advice. You should consult a lawyer for individual advice or assessment regarding your own situation. The article only reflects the views of the author.