It might be time for IT to consider AI models that don’t steal

July 22, 2025 Yanac

With enterprises pouring billions of dollars into generative AI (genAI) initiatives, doubts about future legal exposures are typically ignored.

The risks are practically endless. Although enterprises usually do extensive data fine-tuning before deploying large language models (LLMs), the massive underlying database is unknown. The major model makers — including OpenAI, Google, AWS, Anthropic, Meta, and Microsoft — provide no visibility into their training data. That includes how old or out-of-date it is, how reliable it is, source languages, and, critically, whether the data violates privacy rules, copyright restrictions, trademarks, patents, or regulatory sensitive data (healthcare data, financial data, PII, payment card details, security credentials, etc.).

Even when vendors provide source lists for the data used to train their models, those lists may not include meaningful data. For example, a source might be “Visa transaction information.” How old? Is it verified? Has it been sufficiently sanitized for compliance?

Could the concept of the crime of receiving stolen property — where, in some US states, the defendant merely has to have reason to believe that the property may have been stolen — apply? Enterprises know that much of the training data in the LLMs they use violates copyright and other rules, but they treat it as a “don’t ask, don’t tell” situation. If this explodes in two years, corporations are going to have a hard time convincing a judge or jury that they didn’t have a good idea that the underlying data was stolen.

There are also the derivative risks. Let’s say that someone has figured out a new way of calculating geothermal energy output. What if an LLM maker scraped that information without permission and trained its model on it? And, say, ExxonMobil licensed that model and inadvertently used the information to create a more profitable means of extracting energy that generates $5 billion in fresh profits? Could the original inventor later sue ExxonMobil for a portion of those profits?

“This is something I have been thinking about intermittently for the past year or so,” said Jason Andersen, a VP/principal analyst for Moor Insights & Strategy. “This is going to be a big consideration in the future. As the costs of training and tuning open-source foundation models continue to drop, enterprises would be foolish to leave themselves exposed, especially in this high-stakes regulatory environment.”

Andersen said this issue is becoming more complicated due to the June court decision regarding Anthropic and fair use. The court in that case ruled that it was legal for Anthropic to train its AI models on published books without the authors’ permission — as long as the company had legally and properly purchased copies of those books.

“It’s a multilevel problem and, frankly, the Anthropic case sort of skirts the issue,” Andersen said. “My musing has been why more companies are not thinking or caring about this.”

Models that don’t steal data

One option that has many pros and cons is to use genAI models that explicitly avoid training on any information that is legally dicey. There are a handful of university-led initiatives that say they try to limit model training data to information that is legally in the clear, such as open source or public domain material.

Common Pile, for example, was created by researchers from University of Toronto, Cornell, University of Maryland, MIT, Carnegie Mellon, and Lawrence Livermore Laboratory, along with some AI vendors including Hugging Face and EleutherAI.

It is an 8TB collection of openly licensed text designed for training LLMs. It includes content from, according to the Common Pile website, “30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts.”

The Common Pile team says it “validates our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.”

Other efforts include Pleias, which offers ethically trained small language models, and Fairly Trained, a nonprofit organization that certifies genAI models that don’t use copyrighted materials without the owners’ consent.

There are some concerns that the early open models lag in performance when compared with the major commercial models. Common Pile’s creators, for instance, say it performs on par with Meta’s two-year-old Llama 2 model.

Dion Wiggins, the CTO at Omniscien Technologies in Singapore, has posted favorably on LinkedIn about the open data efforts, but he concedes that there is a meaningful performance lag — at least for now.

“Is it practical to replace the leading models of today right now? No. But that is not the point. This level of quality was built on just 32 ethical data sources. There are millions more that can be used,” Wiggins wrote in response to a reader’s comment on his post. “This is a baseline that proves that Big AI lied. Efforts are underway to add more data that will bring it up to more competitive levels. It is not there yet.”

Still, enterprises are investing in and planning for genAI deployments for the long term, and they may find in time that ethically sourced models deliver both safety and performance.

What about copyright indemnification?

Tipping the scales in the other direction is the big model makers’ promises of indemnification. Some genAI vendors have said they will cover the legal costs for customers who are sued over content produced by their models.

“If the model provides indemnification, this is what enterprises should shoot for,” Moor’s Andersen said. “It says that the vendor assumes responsibility for training the model and intends not to utilize copyrighted data. However, it acknowledges that mistakes can occur, especially when examining a general-purpose model. If a copyright holder comes forward and makes a claim, the model vendor will take full responsibility.”

Not all vendors — or all products offered by the same vendor — promise the same level of protection, so it’s important to read the fine print. “IBM has the broadest indemnification, but many of the major vendors also offer some form of protection,” Andersen said. “For instance, it’s part of [Anthropic] Claude’s enterprise level tier, but not the cheaper ones — and Adobe and Getty get this automatically since they only use their [own] data [for model training].”

Others question how much protection such a policy would deliver, given the wide range of possible liability exposures.

“Software vendors are highly unlikely to [meaningfully] indemnify. What exactly are they indemnifying you against? You will be liable for what you generate and use via AI,” said Mark Rasch, a former federal prosecutor who specializes in technology legal issues. Today, he serves as a professorial lecturer in law at the George Washington University Law School and as legal counsel for Unit 221B, a data privacy and security compliance consulting firm.

“Could you impose on the model a requirement to make sure that it doesn’t infringe on anyone’s content?” asked Rasch, adding that it would likely prove difficult if not impossible to enforce. “The copyright infringement problem is very vague. Using AI means that you are venturing into unknown territory, from a practical and legal perspective.”

Weighing risks and benefits

Retail giant Macy’s uses LLMs from the “the top five” model makers, said Brian Phillips, VP of Technology. Regarding the legal complexities of copyright and other issues when leveraging genAI, “it is absolutely a risk,” he admitted.

It’s a difficult balancing act, Phillips said. “We are both trying to be cutting edge” and taking steps to be “innovative with constraints.” In balance, though, he believes that using the major genAI models delivers benefits that “outweigh the risks.”

Phillips spoke of a “risk transference” from Macy’s to the genAI model makers. “We are expecting them to have a sanitized model,” he said, adding that the major corporate model makers Macy’s is working with have assured the retailer that they “will take on any liability.”

Andersen agreed. “I don’t think these smaller academic models offer anything magical just because they have clean data. That is not a big enough differentiator,” he said.

Andersen also stressed that the claims of using only clean data are just that: claims.

“The outside data coming in is somehow ‘certified’ as clean. There are many different models and datasets available today that can claim this on Hugging Face. Some licensing schemes, such as Creative Commons, also support this,” he said. “But we still need to take someone’s word for it, and models and datasets can be fluid, so how can an enterprise keep up? This is what we see Fairly Trained trying to do.”

And there are still risks even with open data sources, Andersen added. “If you are using these models for discovery” and it yields a new invention or a new drug, “someone may [still] come around and say, ‘I deserve a royalty.’”It might be time for IT to consider AI models that don’t steal – ComputerworldRead More