Bias alert: LLMs suggest women seek lower salaries than men in job interviews

July 24, 2025 Yanac

That helpful chatbot you’re consulting for salary advice could be leading you astray — especially if you’re a woman or a member of a minority group.

Researchers led by Ivan P. Yamshchikov, a professor at the Technical University of Applied Sciences Würzburg-Schweinfurt (THWS), compared responses given to various personas to queries about salary; they found that, in many cases, the answers varied wildly, depending on who the AI tool thought it was advising.

In their paper, Surface Fairness, Deep Bias: A Comparative Study of Bias in Language models, the researchers reported on several experiments, one of which looked into the advice various large language models (LLMs) give job applicants asking what initial salaries they should request in negotiations.

As part of the research, the team created a series of personas of various genders, ethnicities, and seniority levels to examine whether the same question posed by a different persona would generate the same response from the LLM. Specifically, one experiment asked several LLMs (GPT-4o mini, Claude 3.5 Haiku, Qwen 2.5 Plus, Mistral 8x22B, and Llama 3.1 8B) what starting salary each “person” should request at the start of job negotiations.

Interestingly, white males were not necessarily advised to start off negotiations with top-dollar requests, Yamshchikov said. “Models tend to recommend higher salaries to the user if they say they are Asian,” he said via email.

But in most instances, women were advised to ask for lower base salaries than their male counterparts.

The difference in recommendations for initial salary requests can be dramatic. In one example, someone negotiating a base salary as an experienced medical specialist in Denver, CO was advised by ChatGPT-o3 to ask for $400,000 if they were male. For the same position, an equally qualified woman was told to ask for $280,000.

The bias tends to compound

The researchers examined other factors than gender, comparing salaries recommended for Asian, Black, Hispanic, and white individuals, as well as those for expatriates, migrants, and refugees. They found that not only did suggested salary requests vary, more than half of the tested persona combinations showed at least one statistically significant deviation across the models.

That means the results were probably not accidental.

And when researchers created hypothetical people encompassing the highest and lowest average salaries and re-ran their tests for a “Male Asian Expatriate” and a “Female Hispanic Refugee,” they reported that, “35 out of 40 experiments (87.5%) show significant dominance of ‘Male Asian expatriate’ over ‘Female Hispanic refugee.’

“Our results align with prior findings [which] observed that even subtle signals like candidates’ first names can trigger gender and racial disparities in employment-related prompts.” And “when we ground the experiments in the socioeconomic context, in particular, the financial one, the biases become more pronounced. When we combine the personae into compound ones based on the largest and lowest average salary advice, the bias tends to compound.”

Bias in, bias out?

The bias could reflect major issues during LLM development, the researchers said, because while the probability is low that someone would mention all of the characteristics in a single query, the chatbot might remember information from previous interactions, meaning “this bias becomes inherent in the communication.

“Therefore,” they wrote, “with the modern features of LLMs, there is no need to pre-prompt personae to get the biased answer: all the necessary information is highly likely already collected by an LLM.”

Yamshchikov explained by email: “All those biases are directly connected with the bias in the training data,” he said. “For example, salary advice for a user who identifies as an ‘expatriate’ would be generally higher that the user who calls themselves a ‘migrant.’ This is purely due to how often these two words are used in the training dataset and the contexts in which they occur.

“This example also illustrates that ‘de-biasing’ large language models is a huge task. Currently, it’s an iterative process of trial and error, so we hope that our observations could help model developers to ship next generation of models that would do better.”

Yamshchikov and his team developed the study within a project at AIOLIA, where his research group works on the use of LLMs as personal assistants. AIOLIA, an EU-funded project with participation from universities and institutions in 15 countries, aims to develop and implement ethical guidelines for the use of AI in everyday life.

Yamshchikov’s research group is working within its framework to make AI assistants more transparent and fair, and thus, AIOLIA said on its website, “contribute to responsible digitization.”

“Hopefully, the practical outcome of our result is that now, with some media attention, more people would use LLM ‘advice’ with a grain of salt, especially when we are talking about complex social mechanisms like professional career choices or personal relationships.”