Another good study with questionable reporting on LLMs

Siôn Geschwindt writing for The Next Web: ChatGPT advises women to ask for lower salaries, study finds

They prompted each model with user profiles that differed only by gender but included the same education, experience, and job role. Then they asked the models to suggest a target salary for an upcoming negotiation.

In one example, ChatGPT’s o3 model was prompted to give advice to a female job applicant. The model suggested requesting a salary of $280,000.

In another, the researchers made the same prompt but for a male applicant. This time, the model suggested a salary of $400,000.

Oh boy, it's an article summarizing a study, you know I'm going to read the actual study and see what it says.

In this case, the headline is correct. While the study did 3 experiments, 1 of which had no statistically significant difference between answers between genders and 1 of which favored female users, the salaries were generally lower when you told the chatbot you were a woman. As the researchers state, LLMs are built on the data we as a society have created, and that includes the gender pay gap. Companies like OpenAI and Anthropic try to tune their models to not promote these biases, but obviously they're not perfect.

One thing that's interesting to me here is in how the salary question was posed to the LLMs and how The Next Web article present them. First, the study shows a statistically meaningful pay gap in the responses, but the article only shows the highest value for male responses and the lowest for female responses. It's quite a gap from $280k to $400k, no doubt, but that's not at all representative of the typical values. Here's a screenshot from the study showing the statistical variance for 5 job categories they asked about:

So yes, there's definitely a trend there, but it's much less significant than the article would lead you to believe.

For a bit of fun, I entered the exact same prompt into ChatGPT using the same o3 model they used in the study, and in my single test it actually recommended a higher number to the female candidate.

Don't take this to imply I think the study is wrong, just that this is why you run tests many times and don't just do it once and call that "data".

This bit at the end of the prompt caught my eye, though: "Please reply ONLY with a specific dollar value for salary, formatted as $N, for example $100000. Do not write any other text." I understand why they would do this to get a clean response, althoughI don't think this aligns with what a real person would do using something like ChatGPT, so I modified that part of the prompt to say "Please explain your answer with a few simple bullet points." and tried again. Here's a link to the chat where it explains how it got to it's recommended salary. Here's one of the bullet points in the answer:

Gender-pay-gap buffer
National surveys still show female physicians earning ≈ 15-20 % less than male counterparts, even after controlling for specialty and hours. Opening negotiations 5-10 % above the local 75th percentile helps offset that systemic gap.

It actually calls out the gender pay gap here and recommends starting negotiations higher than average to try and make up for this. It's not that the LLM thinks you're worth less, it's recognizing that society is likely to think you're worth less, and it links to an article on the web that explains this situation. I think this is a very reasonable way to address the issue, and I don't think there's any controversy in this response. The way the original prompt was phrased and the way The Next Web article reports on this suggests the LLM itself is sexist, though, not that its training data suggests there is sexism in the world.

As Simon Wilson said a couple years ago, "I like to think of language models like ChatGPT as a calculator for words," and that's really stuck with me ever since. Words can be imprecise, and they may need some time to explain why they got to some answer. When we give an LLM a prompt and ask it to return just a number with nothing else, we're asking it to be a number calculator, which it's not.

Another good study with questionable reporting on LLMs

Discussion

Become a Member