Training AI systems: Is it stealing or just learning by example?

In an era where Artificial Intelligence (AI) increasingly pervades creative domains, the question of how copyright laws apply to AI-generated content is becoming more pressing. As AI systems grow more sophisticated, they are capable of generating everything from text and music to visual art and even software code. Both AI and creative sectors contribute significantly to the UK’s economy, with creative industries generating £124.8 billion annually. This raises crucial legal questions about the protection of intellectual property rights in a growing AI world. 

There are three key issues and challenges currently facing AI and copyright:

  1. Ownership rights
  2. Authorship
  3. Training AI 

This article focuses solely on Training AI.

Training AI on copyrighted material

AI systems like ChatGPT are trained on vast datasets that often include copyrighted material, some of which is scraped from the internet without the permission of the copyright owners (in violation of copyright law). But how can AI learn and be trained if there are limitations on the training data that can be used? Unlike trade marks, there is no register of copyrighted works, as such it can be difficult for AI developers to distinguish between copyrighted works, and works they are free to use. The current legal framework lacks clear guidance on this issue and arguably fails to fairly balance the interests of AI developers and rightsholders.  Copyright owners struggle to enforce their rights and secure fair compensation for the use of their works by AI systems, while AI developers face legal uncertainty over what data they can and can’t use, which ultimately hinders innovation. As a result, AI developers often train their models in jurisdictions with clearer rules, limiting AI investment and opportunities in the UK, but also using copyrighted materials unlicensed infringing UK copyright laws. 

Case study – Getty Images v Stability AI

In January 2023, Getty Images commenced proceedings in the High Court of Justice against Stability AI. Getty Images has accused Stability AI of scraping 12 million copyrighted images, including over 50,000 images that were exclusively licensed to Getty Images, along with associated metadata, from its platform without a licence or authorisation, to train its AI model ‘Stable Diffusion’.

High Court of Justice – Getty Images v Stability AI

Getty Images’ claims are:

  1. Direct copyright infringement pursuant to section 16 of the Copyright, Designs and Patents Act 1988 (CPDA) – Stability AI downloaded and used copyrighted works to train Stable Diffusion without a licence or authorisation. 
  2. Secondary copyright infringement pursuant to section 22 of the CPDA – Stability AI imported the pre-trained Stable Diffusion software into the UK which contained infringing content.
  3. Copyright infringement – outputs produced by Stable Diffusion reproduced substantial parts of Getty Images’ copyrighted works. 
  4. There are further claims for infringement of database rights, trade marks and passing off. 

Stability AI’s defences include:

  1. That copyright infringement did not occur in the UK and therefore they have not breach UK law. Stability AI have argued that the training and model development of Stable Diffusion was done outside of the UK, and that the servers that store the training database are located in the United States. 
  2. Stable Diffusion does not memorise or reproduce individual images from the dataset it was trained on, as such it is not infringing copyrighted works. 
  3. Due to how the AI model creates outputs, Stable Diffusion will reproduce variable images from the same or similar text prompts meaning that no particular image can be generated from any particular prompt.

The trial is set for 9 June 2025 and proceedings are also underway in the US. This case is one of the most high-profile legal battles over AI training data and copyright. A ruling in Getty Images’ favour could set a precedent for stricter copyright enforcement on AI-generated content. However, if Stability AI wins, it could open the door for broader use of copyrighted materials in AI training without explicit permission. The trial is due to take place this summer and its outcome could significantly influence copyright licensing and lead to changes in UK copyright law. If such reforms occur, they could substantially affect the UK’s appeal as a location for developing AI solutions.

Government response 

The government has recognised it needs to take proactive steps to improve protections for rightsholders, but also promote the UK as a top location for AI development. The recent Copyright and AI Consultation sought feedback on the government’s proposed changes to current law in order to address copyright and AI. The key proposed solutions are:

  • Increased transparency – AI developers would be required to disclose to rightsholders what  datasets they have used and how they are obtained in order for rightsholders to make an informed decision about if they wish to explicitly opt-out of their works being used to train AI; and
  • Expanding the current exception to general copyright law for text and data mining (which permits copying of copyrighted works but only for non-commercial research purposes), allowing developers to use copyrighted materials for AI training, including for commercial purposes, unless creators have explicitly reserved their rights or opted-out. If a creator has not reserved their rights or opted-out then their works can be freely mined, scraped or used for training of AI models. 

The last proposal may be the most controversial. Default opt-ins are already a major issue in the tech industry, with companies such as X and Meta automatically collecting their user’s data to train their AI systems without giving the user a chance to opt-out. Requiring rightsholders to opt-out their copyrighted works from being used to train AI arguably puts an unfair burden on them and puts their rights at risk of infringement by AI developers. 

What can be done now?

Whilst we wait for the government to decide how to proceed following the consultation, there are a number of things that can be done by parties to ensure a balance between AI innovation and copyright protection is achieved:

AI Developers – AI developers must take care to comply with current copyright laws when using data for AI training purposes. This can be done by ensuring they are obtaining licences from and/or compensating the copyright owner of materials. 

Users of AI – users of AI systems should exercise caution by questioning the source of the outputs produced by AI. Content creators must also take care when using AI to create original content to ensure they aren’t unknowingly using another’s original content and also to ensure they are preserving their own rights (see our article ‘AI writes a bestseller: But who gets the byline?’ for further details). Users of AI systems can protect themselves by inserting warranties that confirm that the AI developer has the right to use data on which the system has been trained, backed up by an indemnity in the event of breach.

Copyright owners – Unlike trade marks, there is no register of copyrighted works, as such it can be difficult for owners to track and challenge possible copyright infringement. Until there is a system in place that allows Copyright owners to explicitly reserve their rights they need to ensure that any licensing of their works is properly documented, with clear obligations and restrictions on the licensee. 

A collective effort is required from the government, AI developers and rightsholders in order to successfully address these ongoing issues. Whilst AI is here to stay, it is crucial that we learn how to use it responsibly in order to properly protect the rights of creators without hindering innovation. This balance remains to be achieved.