Newer LLMs are getting worse at some things
Joe Wilkins writing for Futurism: AI Chatbots Are Becoming Even Worse at Summarizing Data
Alarmingly, the LLMs' rate of error was found to increase the newer the chatbot was â the exact opposite of what AI industry leaders have been promising us. This is in addition to a correlation between an LLM's tendency to overgeneralize with how widely used it is, "posing a significant risk of large-scale misinterpretations of research findings," according to the study's authors.
Yesterday I posted about how we don't really understand why (not "how", but "why") LLMs work as well as they do, and this study showing that the most advanced the chatbot gets, the more likely it is to make certain errors around data summarization is fascinating. In early 2023 AI boosters were like "if we see LLMs get better at this rate forever, they're going to be insanely smart in no time!" Yet, since GPT-4's release 2 years ago, it really feels like the advances are pretty minor. Sure, the tooling is getting better, but most of these gains seem to be to be us finding more effective ways to use the models, not the models themselves getting meaningfully better. Even the "thinking" models we've seen post Deepseek R1 have been largely been customized models that are to to use chain-of-thought responses which we've found helps them get to better answers.
I'm not denying that models are improving, but I just look at my own experience and stories like this, and it really feels like we had a massive breakthrough in late 2022 that lead to ChatGPT's historic launch, we had some meaningful improvements in the 6 months after that, and ever since then the model gains have been marginal. I'm sure an expert could tell me all the ways the new models are better, but I just don't feel it from the user side.
I guess what I'm getting at is I'm still not convinced that LLMs are on an unstoppable march upwards. Maybe something will change and we'll understand "why" they work or some other major insights that enable us to make some breakthroughs, but it just feels to me like we're finding new and interesting ways to use models of a general performance level to enhance their outputs rather than seeing the models themselves approach anything remotely close to "AI". And honestly, LLMs are still a massively impactful technology, even if they never get much better for day-to-day things than they are today.