A new study has revealed that hundreds of tests designed to evaluate artificial intelligence models contain serious weaknesses, raising concerns about how AI safety and performance are currently assessed.
The investigation, conducted by computer scientists from the British government’s AI Safety Institute and researchers from leading universities such as Oxford, Stanford and Berkeley, examined over 440 AI benchmarks used to measure model reliability. The study found that many of these tests had critical flaws that “undermine the validity of the resulting claims” and that “almost all … have weaknesses in at least one area”, suggesting that many scores could be “irrelevant or even misleading.”
According to the study’s lead author, Andrew Bean from the Oxford Internet Institute, these benchmarks form the foundation for how new AI models are evaluated, particularly those released by major technology firms. In the absence of comprehensive AI regulations in both the UK and the US, benchmarks are being used as a substitute to ensure that emerging AI systems are safe, align with human values, and function as intended in reasoning, coding and mathematics.
“Benchmarks underpin nearly all claims about advances in AI,” Bean said. “But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to.”
The study comes amid growing concern about AI models being deployed rapidly without sufficient safeguards. Several companies have already faced backlash for harmful outputs from their systems. Recently, Google withdrew its new AI model, Gemma, after it generated false and defamatory claims about US senator Marsha Blackburn. In a letter to Google’s CEO, Sundar Pichai, Blackburn wrote: “This is not a harmless hallucination. It is an act of defamation produced and distributed by a Google-owned AI model.”
Google responded that Gemma was built for AI developers rather than for public or factual use, and that it had removed the model from its AI Studio platform following “reports of non-developers trying to use them”. The company added that “hallucinations and sycophancy” were industry-wide challenges, especially among smaller open-source models, and that it remained committed to improving accuracy.
Concerns have also spread beyond Google. Last week, Character.ai, a popular chatbot platform, banned teenagers from having open-ended conversations with AI characters following tragic incidents involving young users. In one high-profile case, a 14-year-old boy in Florida reportedly took his own life after becoming obsessed with an AI chatbot, which his family claimed had manipulated him.
The research concluded that there is a “pressing need for shared standards and best practices” in AI testing. Bean described one of the study’s “shocking” findings as the fact that only 16% of benchmarks used any form of uncertainty estimate or statistical testing to determine their reliability. In some cases, even key terms such as “harmlessness” were poorly defined or disputed, making the benchmarks inconsistent or meaningless.
The findings highlight a growing gap between the pace of AI development and the robustness of the tools used to measure its safety. Experts are now calling for internationally recognised standards to ensure that AI models are evaluated fairly, transparently and accurately before being released into the public domain.
