Copyright, Training Data, and the Cases Reshaping AI

The artificial intelligence revolution is running on a high-octane fuel: data. Trillions of data points—images, articles, songs, and books—have been fed into models to create the generative tools reshaping our world. But this digital feast has come with a massive legal hangover. A firestorm of high-profile lawsuits has erupted, pitting creators and media giants against the tech behemoths powering the AI boom. The central question is no longer just what AI can do, but what it is legally allowed to know. These legal battles are not mere squabbles over royalties; they represent a fundamental clash that will define the next decade of innovation, creativity, and intellectual property itself.

From courtrooms in San Francisco to legislative chambers in Brussels, the rules of the road for AI are being written in real-time. The outcomes of these cases could either solidify the dominance of major AI labs or force them to rebuild their models from the ground up, potentially creating a multi-billion dollar licensing market overnight. For businesses, creators, and consumers, the stakes are enormous. Navigating this treacherous landscape requires understanding the key legal precedents being set, from whether an AI can be an “author” to the fierce debate over whether training a model constitutes “fair use” or outright theft. The future of AI doesn’t just hinge on better algorithms, but on the legal frameworks that will govern them.

Table of Contents

The human authorship hurdle: why AI can’t own a copyright (yet)

Before diving into the chaos of training data, a foundational question had to be answered: can a work generated entirely by AI be copyrighted? The U.S. Copyright Office and federal courts have delivered a clear, resounding “no.” The landmark 2023 ruling in *Thaler v. Perlmutter* solidified the principle that copyright protection is fundamentally tied to human creativity. The court upheld the Copyright Office’s refusal to register a work “autonomously created by a computer algorithm,” affirming that human authorship is the non-negotiable bedrock of U.S. copyright law.

However, this seemingly simple rule opens a Pandora’s box of complexity. What happens in cases of human-AI collaboration? The Copyright Office has offered guidance, stating that a work incorporating AI-generated material may be copyrightable if a human has contributed sufficient creative input through selection, arrangement, or modification. This leaves a vast, murky gray area. The line between using AI as a sophisticated tool (like a camera or a word processor) and having it generate the core expressive elements of a work is becoming increasingly blurred. Future legal challenges will undoubtedly force courts to draw finer distinctions, defining exactly how much human touch is required to transform a machine’s output into a protected piece of intellectual property.

Training data on trial: inside the high-stakes lawsuits against AI giants

The main event in the AI copyright arena revolves around the data used to train large language models (LLMs) and image generators. Several blockbuster lawsuits have put the standard industry practice of scraping the web for training data under intense legal scrutiny. In *Andersen v. Stability AI*, a class-action suit brought by artists, the plaintiffs alleged that AI art platforms like Stable Diffusion and Midjourney were built on the unlicensed use of billions of copyrighted images, creating what they term a “collage tool” that infringes on their work.

Perhaps the most-watched case is *The New York Times v. OpenAI and Microsoft*. The media giant sued the tech companies in late 2023, claiming that millions of its articles were unlawfully used to train the models behind ChatGPT and other services. The suit argues that these tools not only copied content but now directly compete with the newspaper by generating answers that reproduce its reporting verbatim, siphoning away readers. These cases are challenging the very foundation of the generative AI industry, asking whether the act of “training” is a transformative new use or simply digital piracy on an unprecedented scale. For anyone following the rapid pace of AI, it’s crucial to stay updated on these important announcements.

The “fair use” doctrine: AI’s get-out-of-jail-free card?

At the heart of the defense mounted by AI companies is the doctrine of “fair use.” This legal principle allows for the limited use of copyrighted material without permission for purposes like criticism, commentary, or research. AI developers argue that training models is a transformative use, akin to a human reading books to learn, not to republish them. However, a pivotal 2024 ruling in *Thomson Reuters v. Ross Intelligence Inc.* has cast serious doubt on this defense, at least for certain types of AI.

In that case, the court found that Ross, an AI-powered legal research company, infringed on Thomson Reuters’s copyrighted content by using it to train a competing product. The judge’s analysis of the four fair use factors was particularly revealing. The court ruled that Ross’s use was commercial, not transformative, and directly harmed the market for the original work. While the judge carefully noted the ruling was limited to the non-generative AI at issue, the decision provides a potential roadmap for how other courts might view these disputes. Many now wonder if this court ruling on AI training signals a shift against the tech industry.

Breaking down the four factors in the AI context

The fair use analysis hinges on a delicate balance of four key factors, each now being reinterpreted for the age of AI. Understanding them is critical to grasping the core legal conflict.

Purpose and Character of the Use: Was the new work transformative, adding a new expression or meaning, or was it merely a substitute for the original? AI companies claim training is transformative, while plaintiffs argue the output often directly competes with the source material.
Nature of the Copyrighted Work: Use of factual works is more likely to be considered fair than the use of highly creative or artistic works. This could lead to different outcomes for models trained on news articles versus those trained on novels or fine art.
Amount and Substantiality of the Portion Used: How much of the original work was copied? AI models are trained on entire works, but defendants argue that no single work is stored or reproduced in its entirety in the final output.
Effect on the Potential Market: This is often considered the most important factor. Does the new work harm the original’s market value? The New York Times lawsuit hinges on this, arguing that AI-generated summaries directly cannibalize their subscription and ad revenue.

The ongoing legal battles over AI training data are complex, and the central conflict often comes down to a simple question: is it training data or taking data?

A global tug-of-war: how the US, EU, and China are shaping AI copyright rules

The fight over AI and copyright isn’t confined to the United States. A patchwork of global regulations is emerging, creating a complex compliance challenge for tech companies. The European Union has taken a proactive, regulatory-heavy approach with its landmark AI Act. This comprehensive framework introduces a risk-based system and, crucially for copyright holders, imposes transparency obligations on developers of generative AI. They will be required to maintain and disclose detailed summaries of the data used for training, a measure intended to help enforce copyright law.

Meanwhile, China has taken a surprisingly progressive, albeit state-controlled, stance. In a 2023 ruling, the Beijing Internet Court granted copyright protection to an AI-generated image, reasoning that it reflected the human’s intellectual investment and aesthetic choices in crafting the prompts. This contrasts sharply with the U.S. position. However, China also imposes strict liability on AI companies for unlawful content produced by their models and requires all AI-generated content to be clearly labeled. This global divergence means the future of AI development will be shaped not by one legal standard, but by a geopolitical contest of competing legal philosophies.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”Can I copyright something I made using an AI tool?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”It depends on the level of human involvement. If you simply used a text prompt to generate an image or text and made no further creative changes, it is unlikely to be protected by copyright under current U.S. law. However, if you significantly modify, arrange, or combine AI-generated elements in a creative way, the resulting work may be eligible for copyright protection based on your human authorship.”}},{“@type”:”Question”,”name”:”Is it illegal for companies to use copyrighted works to train AI models?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”This is the central question in many ongoing lawsuits, and the answer is not yet settled. AI companies argue it falls under the ‘fair use’ doctrine, while copyright holders call it infringement. A recent court decision against a non-generative AI company found that using copyrighted material to train a competing product was not fair use. The outcomes of major cases against generative AI developers will set a clearer precedent.”}},{“@type”:”Question”,”name”:”What is the Generative AI Copyright Disclosure Act?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”This is a proposed bill in the U.S. Congress introduced in 2024. If passed, it would require AI developers to submit a public notice to the Copyright Office detailing all copyrighted works used in their training datasets. The goal is to increase transparency and allow creators to know if their work was used to train an AI model.”}},{“@type”:”Question”,”name”:”How is the EU’s approach to AI copyright different from the US?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”The EU is taking a more comprehensive regulatory approach with its AI Act. It will require developers of general-purpose AI models to maintain detailed documentation of their training data and provide summaries to the public. This is a proactive regulatory approach, whereas the U.S. is currently relying on courts to interpret existing copyright law on a case-by-case basis.”}}]}

Can I copyright something I made using an AI tool?

It depends on the level of human involvement. If you simply used a text prompt to generate an image or text and made no further creative changes, it is unlikely to be protected by copyright under current U.S. law. However, if you significantly modify, arrange, or combine AI-generated elements in a creative way, the resulting work may be eligible for copyright protection based on your human authorship.

Is it illegal for companies to use copyrighted works to train AI models?

This is the central question in many ongoing lawsuits, and the answer is not yet settled. AI companies argue it falls under the ‘fair use’ doctrine, while copyright holders call it infringement. A recent court decision against a non-generative AI company found that using copyrighted material to train a competing product was not fair use. The outcomes of major cases against generative AI developers will set a clearer precedent.

What is the Generative AI Copyright Disclosure Act?

This is a proposed bill in the U.S. Congress introduced in 2024. If passed, it would require AI developers to submit a public notice to the Copyright Office detailing all copyrighted works used in their training datasets. The goal is to increase transparency and allow creators to know if their work was used to train an AI model.

How is the EU’s approach to AI copyright different from the US?

The EU is taking a more comprehensive regulatory approach with its AI Act. It will require developers of general-purpose AI models to maintain detailed documentation of their training data and provide summaries to the public. This is a proactive regulatory approach, whereas the U.S. is currently relying on courts to interpret existing copyright law on a case-by-case basis.