Generative AI Is Challenging a 234-Year-Old Law

The technology might finally bend copyright past the breaking point, upending what it means to have a creative society in the process.

Latest Feb 29, 2024 0 Add to Reading List

Generative AI Is Challenging a 234-Year-Old Law

It took Ralph Ellison seven years to write Invisible Man. It took J. D. Salinger about 10 to write The Catcher in the Rye. J. K. Rowling spent at least five years on the first Harry Potter book. Writing with the hope of publishing is always a leap of faith. Will you finish the project? Will it find an audience?

Whether authors realize it or not, the gamble is justified to a great extent by copyright. Who would spend all that time and emotional energy writing a book if anyone could rip the thing off without consequence? This is the sentiment behind at least nine recent copyright-infringement lawsuits against companies that are using tens of thousands of copyrighted books—at least—to train generative-AI systems. One of the suits alleges “systematic theft on a mass scale,” and AI companies are potentially liable for hundreds of millions of dollars, if not more.

In response, companies such as OpenAI and Meta have argued that their language models “learn” from books and produce “transformative” original work, just like humans. Therefore, they claim, no copies are being made, and the training is legal. “Use of texts to train LLaMA to statistically model language and generate original expression is transformative by nature and quintessential fair use,” Meta said in a court filing responding to one of the lawsuits last fall, referring to its generative-AI model.

Yet as the artist Karla Ortiz told a Senate subcommittee last year, AI companies use others’ work “without consent, credit, or compensation” to build products worth billions of dollars. For many writers and artists, the stakes are existential: Machines threaten to replace them with cheap synthetic output, offering prose and illustrations on command.

In bringing these lawsuits, writers are asserting that copyright should stop AI companies from continuing down this path. The cases get at the heart of the role of generative AI in society: Is AI, on balance, contributing enough to make up for what it takes? Since 1790, copyright law has fostered a thriving creative culture. Can it last?

Contrary to popular belief, copyright does not exist for the benefit of creators. Its purpose, according to founding documents and recent interpretations, is to foster a culture that produces great works of science, art, literature, and music. It happens to do this by granting people who produce those works a large degree of control over their reproduction and distribution, providing a financial incentive to do the work.

That concern for the public benefit is why the present-day law also allows for certain “fair uses” of copyrighted works. Printing short quotations from books or displaying thumbnails of photographs in search results are considered fair use, for example, as are parodies that use a story’s plot and characters. (Remember Spaceballs?) AI companies claim that training large language models on copyrighted books is also fair use because LLMs don’t reproduce the full text of books they’re trained on, and because they transform those books into a new kind of product.

[Read: There was never such a thing as “open” AI]

These claims are now being tested. “Is it in the public benefit to allow AI to be trained with copyrighted material?” asked Judge Stephanos Bibas a few months ago in an opinion on Thomson Reuters v. Ross Intelligence, which concerns the use of legal documents to train an AI research tool. Bibas noted that each side has its own ideas about what benefits the public. Tech companies argue that an AI product will make knowledge more accessible, while plaintiffs argue that AI will reduce incentives for sharing that knowledge in the first place, because it’s typically stripped of authorship and presented as the creation of AI. Some writers have already stopped sharing their work online, and the courts will have to take seriously the idea that current AI-training practices could have a chilling effect on human creativity.

The fundamental question, as posed by copyright law, is whether a generative-AI product provides a net public benefit. Any product “that substantially undermine[s] copyright incentives” could fail to qualify as fair use, the legal scholar Matthew Sag wrote in testimony to the Senate last year. If people habitually ask ChatGPT about books and articles instead of reading them, the audience for books and articles (and the incentive for writing them) will shrink. Current AI tools that present human-created knowledge without citation are already preventing readers (and other writers) from connecting with people who share their interests, jeopardizing the health of research communities, and harming incentives for cultivating and sharing expertise. All of this could lead to a culture in which the circulation of knowledge is impeded rather than encouraged, and in which a future Salinger decides it isn’t worth writing The Catcher in the Rye. (In response to such concerns, OpenAI insisted this week, in a motion to dismiss The New York Times’ lawsuit against the company, that ChatGPT is a tool for “efficiency” and “not in any way a substitute” for a subscription to the paper.)

Tech companies and AI enthusiasts have argued that if a human needs no special license to read a book in a library, then neither should AI. But as the legal scholar James Grimmelmann has observed, just because it’s fair for a person to do something for self-education doesn’t necessarily mean it’s fair for a corporation to do it on a massive scale for profit.

[Read: What I found in a dataset Meta uses to train AI]

As for the argument that AI training is fair use because it “transforms” the authors’ original works, the case usually cited as precedent is Authors Guild v. Google, in which the Authors Guild sued Google for scanning millions of books to create the research product known as Google Books. In that case, a judge ruled that the scanning was fair use because Google Books primarily functioned as a research tool, with strict limits on how much copyrighted text it revealed, and its purpose (providing insights into the book collection as a whole, as it does through its Ngram Viewer) was very different from the purpose of the books used to build it (which are meant to be read).

But generative-AI products such as ChatGPT and DALL-E do not always serve different purposes from the books and artworks they’re trained on. AI-generated images and text can substitute for the purchase of a book or the commission of an illustration. And although an LLM’s output is usually different from the text it was trained on, it isn’t always.

The recent Times lawsuit and another filed by Universal Music Group show that LLMs sometimes reproduce their training text. According to UMG, Claude, an LLM made by Anthropic, can reproduce the lyrics to entire songs, nearly verbatim, and present them to users as original creations. The Times showed that ChatGPT can reproduce large chunks of Times articles. This behavior is called “memorization.” It’s currently difficult, if not impossible, to eliminate. It can be concealed, to a degree, but because of generative AI’s complexity and unpredictability (it’s often called a “black box”), its makers can’t give any guarantees about how often, and under what circumstances, an LLM quotes its training data. Imagine a student or a journalist who won’t promise not to plagiarize. That’s the kind of ethically problematic position these products are taking.

Courts have occasionally struggled with newfangled technologies. Consider the player piano, a machine that takes rolls of paper as input. The rolls are sheet music in which notes are punched holes rather than written symbols. A publisher of sheet music once sued a piano-roll manufacturer, claiming it was making and selling illegal copies. The case went all the way to the Supreme Court, which in 1908 decided the rolls weren’t copies, just part of the player piano’s “machinery.” In hindsight, the decision makes little sense. It’s like saying a DVD isn’t a copy of a film because it’s a digital rather than analog encoding of the sound and images.

The decision was rendered moot by the Copyright Act of 1909, which said that makers of piano rolls did owe royalties. But, as Grimmelmann told me, arcane technologies of reproduction can be deceptive. They can seem magical or incomprehensible, disconnected from the intellectual property that powers them.

[Read: Welcome to a world without endings]

Some people wonder whether copyright law, fundamentally unchanged since the late 1700s, can handle generative AI. Its basic unit is the “copy,” a concept that’s felt like a poor fit for modernity since the launch of music and video streaming in the 1990s. Might generative AI finally bend copyright past the breaking point? I spoke about this with William Patry, a former senior official at the U.S. Copyright Office, whose treatises on copyright are among the most frequently cited by federal courts, and who was also a senior copyright counsel at Google during the Authors Guild litigation. “I wrote laws for a living for seven years,” he told me. “It’s not easy.” New technologies frequently come along that threaten to upend legal regimes and social norms, he said, but the law can’t constantly change to accommodate them.

The language of copyright can seem frustratingly out of date, but good laws probably need to be this way. They have to be stable, so we know what to expect, and also dynamic, “in the sense of having play in the joints,” Patry, the author of a book called How to Fix Copyright, said. He’s critical of certain aspects of the law, but he doubts AI will be the technology to finally break it.

Instead, he said, judges may be cautious with their decisions. A blanket ruling about AI training is unlikely. Instead of saying “AI training is fair use,” judges might decide that it’s fair to train certain AI products but not others, depending on what features a product has or how often it quotes from its training data. We could also end up with different rules for commercial and noncommercial AI systems. Grimmelmann told me that judges might even consider tangential factors, such as whether a defendant has been developing its AI products responsibly or recklessly. In any case, judges face difficult decisions. As Bibas admitted, “Deciding whether the public’s interest is better served by protecting a creator or a copier is perilous, and an uncomfortable position for a court.”

Generative AI could not compensate for the loss of novels, investigative journalism, and deeply researched nonfiction. As a technology that makes statistical predictions based on the data it has encountered, it can produce only knockoffs of what’s come before. As the dominant mode of creativity, it would stop a culture in its tracks. If human authors aren’t motivated to write and publish works that move us, help us empathize, and take us to imaginary places that shift our perspective and help us see the world more clearly, we will simply have a culture without those things. Generative AI might offer synthetic reminders of what came before, but can it help us build for the future?