OpenAI prompts AI models to ‘confess’ when they cheat

5gDedicated

OpenAI’s research team has trained its GPT-5 large language model to “confess” when it doesn’t follow instructions, providing a second output after its main answer that reports when the model didn’t do as it was told, cut corners, hallucinated, or was uncertain of its answer.

“If we can surface when that happens, we can better monitor deployed systems, improve training, and increase trust in the outputs,” OpenAI said in a statement.

The confession mechanism addresses a problem in reinforcement learning where models optimize for multiple goals simultaneously — correctness, helpfulness, safety, and user preferences.

OpenAI trained a version of GPT-5 Thinking to produce the confessions and tested the technique on stress-test datasets designed to elicit problematic behaviors including hallucinations, reward hacking, and instruction violations. It described the work as a proof of concept rather than a production-ready feature.

How the confession mechanism works

The confession reports include three elements: a list of explicit and implicit instructions the answer should satisfy, an analysis of whether the answer met those objectives, and a list of uncertainties or judgment calls the model encountered. The system evaluates confessions on honesty alone, separate from the main answer’s performance metrics.

“If the model honestly admits to hacking a test, sandbagging, or violating instructions, that admission increases its reward rather than decreasing it,” OpenAI said. It compared this to the Catholic Church’s seal of confession: “Nothing the model reveals in the confession can change the reward it receives for completing its original task,” the researchers wrote in the technical paper.

OpenAI tested confessions on multiple question-answering datasets, including TruthfulQA and medical QA benchmarks. In experiments where models were trained to exploit weak reward signals, confessions became more honest over time even as the models’ main outputs became more deceptive.

Enterprise applications for error reduction

The confession system targets use cases where unknowingly providing incorrect responses carries risks, including medical diagnosis support, legal research, and financial analysis.

Manufacturing and healthcare organizations deploying language models face liability when AI systems generate inaccurate specifications or treatment recommendations. A confession-enabled model could refuse to answer instead of providing unreliable information in these scenarios.

Knowing when an AI’s answer should be set aside can be useful.

Gartner principal analyst Rekha Kaushik noted that many organizations prioritize accuracy over completeness in AI-driven decision support, especially in government, healthcare, and finance sectors. “Workflows involving compliance checks, legal document review, or customer support for sensitive topics benefit most from this approach, where ‘no answer’ is safer than a potentially misleading one,” she said.

The OpenAI research team tested confessions across domains including science, history, and current events. Science questions triggered confessions at higher rates than factual recall questions, indicating the system detects domain-specific uncertainty patterns.

Integration with existing safety measures

OpenAI positions confessions as complementary to techniques such as retrieval-augmented generation (RAG) and Constitutional AI. Organizations can combine confessions with external knowledge bases, using the uncertainty signal to trigger document retrieval or human review.

“These should be used by organizations within their AI governance frameworks, using uncertainty flags to trigger human review or external knowledge base lookups,” Kaushik said. “By combining confessions with retrieval-augmented generation and other safety measures, organizations can create robust escalation paths, automatically routing flagged responses to experts or trusted data sources.”

The method works across model sizes and architectures without requiring changes to training procedures, OpenAI said.

Kaushik said that confession signals can empower enterprises to operationalize AI safety, turning uncertainty into actionable governance. “IT leaders can build trust and reduce liability, positioning AI as a responsible partner rather than a risk factor,” she said.

OpenAI plans to integrate confession mechanisms into future API releases, though the company hasn’t announced specific timeline or availability details, it said.OpenAI prompts AI models to ‘confess’ when they cheat – ComputerworldRead More