Eleuther AI releases 8TB collection of licensed and open training data

June 9, 2025 Yanac

AI research organization Eleuther AI has launched a massive text database, Common Pile v0.1, that can be used to train AI systems, according to Techcrunch. The 8TB database consists exclusively of publicly licensed texts, or texts that are classified as public domain.

Common Pile v0.1 was developed over two years in collaboration with Poolside, Hugging Face, the US Library of Congress and the University of Toronto, among others.

The data collection was released after concerns arose about several generative AI (genAI) companies using copyrighted material to train their models without the permission of the copyright owners. Eleuther AI was also behind the collection, The Pile, which has become a central point in the debate; it now wants to show with Common Pile v0.1 that training is possible without copyrighted material.

Common Pile v.01 was reportedly used to train the Comma v0.1-1T and Comma v0.1-2T AI models; Eluther AI claims Comma v0.1-2T performs as well as Meta’s first Llama model in terms of programming, image understanding and math. Eluther AI plans is release more open data collections in the future.Eleuther AI releases 8TB collection of licensed and open training data – ComputerworldRead More