DocLang aims to make documents readable by AI, not humans
AIs struggle to understand documents designed for humans; the DocLang working group seeks to flip that imbalance with its specification for machine-readable business documents “built from the ground up for LLM tokenizers.”
The working group, founded by IBM, Nvidia, and Red Hat and hosted by the Linux Foundation’s LF AI & Data project, aims to create an open, universal, AI-native document format designed to improve how enterprises prepare, exchange, and govern document data for AI systems. ABBYY and Human Signal will also be involved in its development, and other contributors are welcome.
“Enterprises today work across a fragmented landscape of document formats, including PDFs, JPEGs, and other file types built primarily for human consumption rather than AI interpretation,” the group said in its launch announcement.
“This disconnect can introduce complexity, raise costs, and reduce reliability when extracting meaning from business documents,” as organizations increasingly rely on generative AI and agentic systems, it said.
Mark Collier, executive director of LF AI & Data, said the goal of the DocLang Specification Working Group is to “develop a vendor-neutral, interoperable standard that helps organizations prepare document data for AI more reliably, transparently, and at scale.”
DocLang defines a structured, machine-readable format for documents of any type, like JSON for data, that any tool can implement and any pipeline can consume. It builds on DocLing, a document processing toolkit hosted by LF AI & Data that can transform human-readable PDFs, word processor documents or spreadsheets into structured data.
Standards must evolve for AI
Something like DocLang is needed, said independent technology analyst Carmi Levy. “Existing document standards have done an admirable job allowing global stakeholders to confidently collaborate for decades, but it’s becoming increasingly clear that they are in desperate need of an update as AI reshapes the rules around how work gets done,” he explained.
Largely static document types, he said, “can be somewhat limiting when AI is redefining the very word, ‘document.’ In many ways. AI-age documents are far more iterative and dynamic than what they once were, and the definitions need to evolve with the times. The documents we currently live with simply weren’t designed for the AI age.”
Within that context, Levy said, “DocLang represents an early, best hope of achieving some kind of foundational baseline for document standards, one that will hopefully allow more intelligent, more efficient, lower-risk workflows than is currently the case.”
Taking an open-source, vendor-agnostic approach to the process ensures the collective will take precedence over the needs of specific vendors, he said, adding, “earlier standards-setting efforts around networking, documentation, the web, and the cloud powered the free-flowing digital landscape that defines modern life.”
An AI-centric documentation standard will carry that reality into the next generation of technology, said Levy.
A question of governance
The entire concept of LLMs, Jason Andersen, principal analyst at Moor Insights & Strategy said, “involves using natural human languages. The computer is supposed to understand us without us changing our syntax or language. Forcing a syntax on users is exactly what we have today with SEO and more advanced programming languages.”
With something like DocLang, where the standard can be applied to content ingestion, he said, “I would be OK with that being automated, which seems to be the intent. The use case I envision is that when I upload a document to an agent, a skill can be run to preprocess the document into the DocLang standard format, saving tokens.”
That makes sense, he said, adding that he thinks it’s good “if it can help generate outputs, like a visualization, that can be shared outside an AI tool. On that front, that is also why I am liking Web MCP, since you are just adding some code to the page, like CSS or JavaScript, and the consumer, in this case, an AI browser or skill, is better equipped to handle the site.”
The point, he said, is, “these standards need to preserve the fact that humans can still do what they want, and do not need to know any coding to be proficient. In terms of governance, I am not sure if it matters.”
But one analyst did foresee governance problems arising from DocLang’s use.
Yaz Palanichamy, senior research analyst at Info-Tech Research Group, said DocLang adoption will require organizations to implement and review controls in order to scale its use accountably and securely.
This article originally appeared on CIO.com.UK move to filter photos and messages triggers encryption worries for CISOs – ComputerworldRead More