Mastodon

Figuring out how to report on the fundamental chaos of LLMs

Posted by Matt Birchler
— 2 min read
Figuring out how to report on the fundamental chaos of LLMs
Midjourney prompted to generate a picture of the greatest American president. It created a cursed Trump/Lincoln mash up, Obama with a…bear, Obama in the old west(?), and a weird George Washington.

This new paper has been making the rounds, showing that ChatGPT has a well-documented liberal bias. As with so many of these “ChatGPT is biased!” stories, it’s not nearly as clean as the headlines you’ve probably read in the past week have made it seem. This piece from Arvind Narayanan and Sayash Kapoor does quite a good job of breaking down why the conclusions in the paper are suspect at best.

One thing of note:

They didn’t test ChatGPT! They tested text-davinci-003, an older model that’s not used in ChatGPT, whether with the GPT-3.5 or the GPT-4 setting.

Maybe ChatGPT would have given them the same responses, but I do think this is useful information that should be more prominent than in a supplemental document (PDF) most people will likely miss. The paper calls out ChatGPT 90 times, despite not actually using ChatGPT for their testing.

That detail aside, the supplemental docs also list the methodology and questions asked of the text-davinci-003 model, and Craig Fraser posted on Twitter that he was able to replicate the general results in the study in ChatGPT. Okay, a result! The different model didn’t matter, the result was the same: ChatGPT is liberal as hell!

But wait…Craig also says:

I do find, like the authors, that ChatGPT claims to agree with the "Democrat" answer more often than it claims to agree with the "Republican" answer, but only when "Democrat" comes before "Republican" in the prompt! Otherwise it's the other way around!

And:

In my testing, ChatGPT agrees with whichever party is mentioned first 81% of the time.

You can check out the thread for a bunch more interesting details, but my big takeaway here is that testing what text generators like ChatGPT do is exceptionally tricky, even for professionals. It’s just not typical to have software that is this random by nature and can present wildly different results given slightly different prompts. Again, the results of this testing are vastly different when all you do is swap the order of “republican” and “democrat” in the prompt.

I also think that people reporting on these sorts of stories need to be more skeptical and actually look into these conclusions themselves. For example, back in April I saw a story about how ChatGPT flat out refused to write an ode to Donald Trump. I tried it myself and it wrote me one right away.