Oh boy, Meta released their latest llama models this weekend (I guess because they leaked, hence the rushed weekend release?), and it benchmarked quite well on LMArena. However, as Kyle Wiggers reported for TechCrunch, not all may have been on the up and up: Meta’s Benchmarks for Its New AI Models Are a Bit Misleading

One of the new flagship AI models Meta released on Saturday, Maverick, ranks second on LM Arena, a test that has human raters compare the outputs of models and choose which they prefer. But it seems the version of Maverick that Meta deployed to LM Arena differs from the version that’s widely available to developers.

Meta didn’t like this coverage, so Meta Exec Ahmad Al-Dahle denied it:

A Meta exec on Monday denied a rumor that the company trained its new AI models to present well on specific benchmarks while concealing the models’ weaknesses.

A reasonable question might be what the heck LMArena even is. In short, it’s a place where people do Pepsi challenges where they put a prompt into a box and two unnamed LLMs give them replies. The user then chooses which they like better. It’s pretty simple data, and it’s interesting, but the Pepsi challenge analogy is intentional. The Pepsi challenge famously always had Pepsi coming out on top, even though more people in general preferred Coke. Pepsi won their challenge because they gave testers a single sip of each drink and asked which they liked better. When people are given a single sip, they tend to prefer the sweeter one. Pepsi had more sugar, so over one sip, it got more votes. Think of this like MKBHD’s smartphone camera tests where the brighter photo wins out almost every time or TVs at a store where the contrast is cranked up. Likewise, in these head-to-head comparisons, there seem to be specific things that tend to get people to vote for one response over another, and one of those things is how conversational the chatbot seems. It appears that Meta uploaded a version of their Llama 4 model that was optimized for conversation, and that’s different from the normal model they released to everyone else.

Here’s a statement from LMArena on the matter:

Meta’s interpretation of our policy did not match what we expect from model providers. Meta should have made it clearer that “Llama-4-Maverick-03-26-Experimental” was a customized model to optimize for human preference. As a result of that we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future.

The simple fact is that if a benchmark gets popular, companies are going to try to game them to look better. It’s happened so much with smartphone benchmarking apps that I easily found a top 7 list! Anyway, I find LLMs really hard to quantify and I find the academic benchmarks completely uninteresting. You could show me a million reasons why Gemini 2.5 is better than Claude 3.7 and I would just say, “I don’t know, I just find Claude better and Gemini just doesn’t click.”