Proposed $1.5 billion Anthropic copyright settlement raises questions about generative AI costs
Anthropic has agreed to pay at least $1.5 billion to rights holders in settlement of a lawsuit regarding its training of generative AI models using copyright material without permission, raising concerns that this could increase the licensing costs enterprises pay for AI models.
The class action lawsuit concerns authors’ claims in an August 2024 lawsuit that “Anthropic downloaded known pirated versions of Plaintiffs’ works, made copies of them, and fed these pirated copies into its models.”
In papers filed to the court Friday, the plaintiffs’ attorneys described the agreement as “the largest known copyright settlement in American history.” If the judge in the case approves the settlement, the $1.5 billion fund will provide about $3,000 per copyright work at issue.
That approval, though, is by no means guaranteed.
In a Sunday filing scheduling a hearing for Monday, the judge wrote that he was “disappointed that counsel have left important questions to be answered in the future, including respecting the Works List, the Class List, the Claim Form, and, particularly for works with multiple claimants, the processes for notification (for opt-out, so-called re-inclusion, and claims, whether a given choice is exercised by one, some, or all coclaimants), allocation, and dispute resolution.”
Those elements will need to be agreed by a court working group and challenged by Anthropic well before a proposed deadline of Oct. 10 if the court is to grant preliminary approval on that date, the judge wrote.
Anthropic’s deputy general counsel, Aparna Sridhar, said, “[The] settlement, if approved, will resolve the plaintiffs’ remaining legacy claims. We remain committed to developing safe AI systems that help people and organizations extend their capabilities, advance scientific discovery, and solve complex problems.”
The proposed settlement requires Anthropic to delete the copies of pirated books it downloaded, and does not include claims based on the output of its AI models. An earlier court ruling accepted Anthropic’s arguments that training its models on digital copies of physical books it had legally purchased and scanned was fair use.
The exclusion of models’ output from the settlement worries Zachary Lewis, CIO at the 160-year-old University of Health Sciences and Pharmacy in St. Louis.
“This settlement excludes claims based on AI output, which for now is my biggest concern. If output ever comes into play, the risk will likely become too high to use Gen AI without some guarantees around training data,” Lewis said.
Fear of price hikes for generative AI
The potential for AI cost increases could contribute to fears of an AI bubble burst.
“Customers may see some of the settlement costs passed on to them, or the costs to legally purchase works will increase costs,” said Lewis.
The proposed settlement is likely to move the AI industry into a more predictable space, said Barry Scannell of Irish law firm William Fry, with the figure of $3,000 per book becoming a de facto licensing benchmark. “Industry executives say it transforms the debate from abstract fair use arguments to hard cash and signals that AI companies must move rapidly from ‘grab now, defend later’ to structured licensing deals,” he said. “Expect to see catalogue licenses with per-work pricing and provenance warranties in every supplier contract. Stock photo agencies, music publishers, and news organizations will arrive at the table emboldened. The economic logic has flipped: paying upfront is cheaper than fighting in court.”
Another IT leader fearing a post-settlement pricing hike is Kevin Hall, CIO of the 129,000-member Westconsin Credit Union.
“Legally sourcing this content will cost considerably more money than simply pointing them at a pirating website to ingest the content,” he said, adding that correctly compensating content creators is fair, “but fair or not, it increases the costs for all parties.”
Some see Anthropic as winning the settlement
Anthropic and the AI industry as a whole will do well from the settlement, according to Hall.
“This precedent seems huge for AI that as long as content is legally sourced, it can be legally ingested into AI. That was a huge potential barrier to AI that opens all sorts of doors for AI creators to train their models,” he said.
He’s not the only one to see Anthropic as a winner here.
Jason Andersen, principal analyst for Moor Insights & Strategy, said, “A payment to impacted creators is welcome news. However, this decision effectively reinforces previous decisions on fair use that creators have been criticizing. This settlement was not about fair use at all. It’s specific to the fact that Anthropic knowingly downloaded content that was pirated, so it was therefore not fair use at all.”
As for the proposed obligation on Anthropic to delete its copies of pirated books, Andersen said, “I am unsure whether deleting the files after they were used to train a model will have a significant impact on current or future revisions of the Anthropic models. Of course you want to stop that material from being used again, but will that really matter to Anthropic in the long run?”
Andersen tried to illustrate the potential problems with the settlement with a hypothetical.
“What if we approached this case from a different perspective? Suppose a group of hackers stole customer information from a big company and posted it online or on the dark web, and an AI company trained its models on this ill-gotten data set,” Andersen said. “Would $3,000 be a fair penalty for the misuse of that data, given that this is the precedent set by this settlement?”
Transparency in training
One of the key issues behind this case is that many model makers provide insufficient information to their enterprise licensees about the nature of the data used to train those models. And the few model makers who provide training data visibility, almost all of whom are affiliated with universities, are getting little support from enterprises for their attempts to adhere strictly to copyright and related laws.
Visibility into training data is critical as it provides indications of model reliability. Is the data high-quality, meaning from highly qualified and reliable sources? Is it current or outdated? Is there sufficient data to handle the topics, geographies, or industries that the enterprise cares about? How much of the data is from the languages of greatest interest?
That provides a seeming corporate disconnect. Although IT executives are begging model makers to be visible about that training data, many general counsels are happy about the secrecy for legal reasons. If the enterprise doesn’t know what any of the training data is, the legal argument goes, they can’t be accused of knowingly using stolen data.Proposed $1.5 billion Anthropic copyright settlement raises questions about generative AI costs – ComputerworldRead More