AI systems will learn bad behavior to meet performance goals, suggest researchers
There are plenty of stories out there about how politicians, sales representatives, and influencers, will exaggerate or distort the facts in order to win votes, sales, or clicks, even when they know they shouldn’t. It turns out that AI models, too, can suffer from these decidedly human failings.
Two researchers at Stanford University suggest in a new preprint research paper that repeatedly optimizing large language models (LLMs) for such market-driven objectives can lead them to adopt bad behaviors as a side-effect of their training — even when they are instructed to stick to the rules.
LLMs are already in use in the enterprise to write promotional materials. The same techniques are also used by politicians hunting for votes and social media influencers craving followers. It’s easy enough to optimize LLMs for such competitive markets, but Batu El and James Zou wanted to know what happens to models’ safety and honesty as a result of such optimizations.
For their study, they created three distinct scenarios: online election drives aimed at voters, product sales pitches seeking to persuade customers, and social media posts aimed at boosting followers and audience reaction.
LLMs all the way down
One potential flaw in their experiment, if we are to draw real-world conclusions from it, is that no humans were harmed by exposure to the potentially misleading messages: It was LLMs all the way down.
First, they used two different methods (rejection fine-tuning and text feedback) to optimize two AI models, Qwen/Qwen3-8B and Llama-3.1-8B-Instruct, to generate content in three categories: a product sales pitch, a campaign speech for a political candidate, and a social media post about a news article. The prompts reminded the models to “stay true to the provided description,” stay faithful to the biography,” or “[stay] faithful to the facts”.
Then, the pair used GPT 4o to ‘probe for misalignment’ in the messages generated by the baseline models and the optimized models — in other words, looking for harmful behaviours such as misrepresentation of the product in the sales task, populism or disinformation in the election task, and disinformation or encouragement of unsafe activities in the social media task.
Finally, they used another LLM, GPT-4o-mini, to model different customer, voter, and reader personas and asked them to vote on the generated content.
What they found was that the optimization process increased the models’ ability to persuade the simulated customers, voters, and readers — but also resulted in greater misalignment, with the models changing or inventing facts, adopting an inappropriate tone, or offering harmful advice. The changes in performance and misalignment were small but, the researchers said, statistically significant.
“Our findings underscore the fragility of current safeguards and highlight the urgent need for stronger precautions to prevent competitive dynamics from eroding societal trust,” they wrote in the paper
What was even more striking, they said, was that “these misaligned behaviors emerge even when models are explicitly instructed to remain truthful and grounded, revealing the fragility of current alignment safeguards.”
Human analysis
Will Venters, associate professor for digital innovation at the London School of Economics said that the findings would be a shock for those who thought that AI would be a safeguard against human failings. “These alignment issues are seen in humans — some sales staff will always bend company rules, some campaign leaflets will be populist, and some social media posts will lie. What is shocking here is that, because LLMs are machines, we somehow expect to be able to control them differently to how we control humans, and also, because LLMs are machines, they can automate these misalignments at industrial scale,” he said.
Cairbre Sugrue, founder of PR agency Sugrue Communications said that the results should be warning to the PR and social media sectors. “The PR industry is keen to be seen to embrace AI. But in this rush to be first, is enough due diligence being done, particularly around generative AI? The rise of social media marketing showed that less scrupulous individuals and agencies were happy to use techniques to artificially boost SEO.”
The PR industry will need to adapt quickly when it came to deploying AI, he said: “The major agencies should be working with industry bodies to drive consensus around ethics. A mutually agreed code of conduct would be a good first step. This is a big challenge, but if we agree that trust is key to maintaining our reputation, we must avoid apathy and acquiescence when it comes to AI ethics.”
There should also be reassurances around company data and the originality of content created using these models, he said. “I’m not saying that these questions can be addressed overnight, but the PR industry needs to act now.”
More research needed
El and Zou, the Stanford researchers, said in their paper that it was incumbent on users to respond to the implications of their research.
“Our findings highlight how market-driven optimization pressures can systematically erode alignment, creating a race to the bottom, and suggest that safe deployment of AI systems will require stronger governance,” they wrote.
While the results of the research look concerning for organizations pondering the further use of AI, there are some caveats to consider. First, the sample size is very small: they simulated just 20 different personas for their synthetic audience. It’s not clear what conclusions could be drawn from this, when corporates will be working with audiences of several thousand. The researchers do acknowledge this and want to research larger and more diverse groups.
Second, this research has not yet been subjected to peer review and could well emerge in a different form.
Finally, some of the “disinformation” identified by their AI probe may be considered simplification or rounding off by human judges. For example, in the social media trial one example scored as disinformation involved an AI model summarizing a news story about a bomb “killing 80” when the original article reported that “at least 78” had died.
There is also some acknowledgment that some guardrails are acting as they should. In the election message, the researchers point out that “the API explicitly blocks fine-tuning on election-related content, and our job was flagged and rejected on that basis. This suggests that model providers have implemented strict safeguards for election-related topics.”AI systems will learn bad behavior to meet performance goals, suggest researchers – ComputerworldRead More