Frontier models are failing one in three production attempts — and getting harder to audit

3 2 minutes read

AI agents have become an integral part of enterprise workflows, but they are still struggling to perform reliably on structured benchmarks, failing roughly one in three attempts. This discrepancy between capability and reliability is identified as the primary operational challenge for IT leaders in 2026, as highlighted in Stanford HAI’s ninth annual AI Index report.

The report introduces the concept of the “jagged frontier,” a term coined by AI researcher Ethan Mollick to describe the boundary where AI excels and then suddenly fails. Despite significant advancements in AI models in 2025, with enterprise AI adoption reaching 88%, there are still areas where these models fall short.

For example, while leading models have shown remarkable progress on benchmarks like Humanity’s Last Exam (HLE) and MMLU-Pro, scoring above 87% and 30% improvement, respectively, they still struggle with tasks like telling time accurately. This disparity is evident in benchmarks like ClockBench, where AI models like Gemini Deep Think and GPT-4.5 High only achieved around 50% accuracy compared to humans’ 90%.

Moreover, AI models continue to face challenges in areas like hallucination rates, multi-step reasoning, and transparency. While some models excel in certain tasks, others struggle with basic perception tasks and reasoning workflows. Additionally, the lack of transparency in model development and evaluation methods is a growing concern, with leading models becoming increasingly opaque and withholding crucial information like training code and dataset sizes.

Furthermore, benchmarking AI progress has become increasingly challenging due to issues like bias reporting, benchmark contamination, and discrepancies between developer-reported results and independent testing. As AI capability outpaces the benchmarks designed to measure it, there is a need for more reliable and comprehensive evaluation methods that can accurately assess AI performance in real-world scenarios.

The report also highlights concerns about data sustainability and scaling, with researchers warning of a potential “peak data” scenario where the pool of high-quality human data is exhausted. To address this, researchers are exploring hybrid approaches that combine real and synthetic data to enhance model training and performance.

In conclusion, while AI capability has surged in recent years, there are still significant gaps in reliability, transparency, and responsible AI practices that need to be addressed. IT leaders and developers must focus on bridging the divide between AI’s demonstrated capabilities in demos and its real-world performance in production environments. By prioritizing transparency, reliability, and responsible AI practices, the AI industry can continue to advance while ensuring the ethical and safe deployment of AI technologies.