OpenAI has released new research identifying a fundamental flaw in how the AI industry measures progress, arguing that standard evaluation methods are inadvertently teaching large language models (LLMs) to “hallucinate.” The paper posits that by rewarding accuracy above all else, industry benchmarks incentivize models to confidently guess rather than admit uncertainty, leading to the plausible but false statements that have become a hallmark challenge for the technology.

Contents

By The Numbers: Guessing vs. Honesty
The Problem with “Teaching to the Test”
A Call to Reform AI Benchmarks
How Hallucinations Originate in Training
Looking Ahead

By The Numbers: Guessing vs. Honesty

OpenAI provided an example from the SimpleQA evaluation, comparing a newer model designed to be more cautious with an older one.

Metric	gpt-5-thinking-mini	OpenAI o4-mini
Abstention Rate (Says “I don’t know”)	52%	1%
Accuracy Rate (Correct Answer)	22%	24%
Error (Hallucination) Rate	26%	75%

Export to Sheets

While the older model (o4-mini) scores slightly higher on accuracy, its error rate is nearly three times higher because it rarely abstains from answering.

The Problem with “Teaching to the Test”

OpenAI’s central argument is that the AI community’s reliance on accuracy-only scoreboards creates the wrong incentives. When a model is asked a question it doesn’t know the answer to, it faces a choice:

Guess confidently: It has a small chance of being correct and earning points.
Abstain (e.g., say “I don’t know”): It is guaranteed to receive zero points.

Over thousands of test questions, the model that strategically guesses will often achieve a higher accuracy score, making it appear superior on leaderboards, even though it is generating far more incorrect information. This encourages developers to build models that prioritize guessing, which in turn perpetuates the problem of hallucinations.

A Call to Reform AI Benchmarks

The proposed solution is a fundamental shift in evaluation philosophy. OpenAI urges the industry to update its primary, widely-used benchmarks to penalize confident errors more severely than expressions of uncertainty. This idea, similar to negative marking on standardized tests, would reward models that are better “calibrated” and know their own limits, leading to the development of more trustworthy AI.

How Hallucinations Originate in Training

The paper also clarifies that hallucinations begin during the pretraining phase. LLMs learn by predicting the next word in vast datasets of text. While this method is excellent for learning consistent patterns like grammar and spelling, it is less effective for memorizing arbitrary, low-frequency facts. Without being explicitly trained on labeled “false” statements, models learn to generate statistically plausible sentences that may not be factually correct.

Looking Ahead

OpenAI’s research is a direct challenge to the AI community’s established practices for measuring progress. If this call to reform evaluation standards is adopted, it could fundamentally alter the development trajectory of LLMs, pushing the entire industry toward creating more reliable and honest AI systems. For the MENA region, where governments and businesses are rapidly adopting AI for critical applications, the development of less hallucinatory models is paramount for building long-term trust and ensuring the safe deployment of AI technologies.

Source: OpenAI

TRENDING

EGA Ramp-Up Investor Pitch Day 2025 to Showcase Top UAE Startups in Dubai

1trepreneur Founders Meetup To Convene Sharjah’s Tech Ecosystem This August

Apply Now For The Scale Now (برنامج واصل) Program Cycle 3

Apply Now For The Social Entrepreneurship Accelerator For The Municipal And Housing Sectors

Browse Categories

About

OpenAI’s New Research Blames Industry Benchmarks for AI Hallucinations

By The Numbers: Guessing vs. Honesty

The Problem with “Teaching to the Test”

A Call to Reform AI Benchmarks

How Hallucinations Originate in Training

Looking Ahead

POPULAR

Follow US

RELATED NEWS

Nvidia’s AI Dominance Propels It To A Landmark $5 Trillion Market Cap

Grammarly Rebrands To Superhuman And Launches New AI Assistant

Thrifty Redefines Car Rentals With New Self-Service Digital Kiosks in UAE

UAE’s AI Office Partners With Google To Launch Nationwide AI For All Initiative

Browse Categories

About

TRENDING

Browse Categories

About

By The Numbers: Guessing vs. Honesty

The Problem with “Teaching to the Test”

A Call to Reform AI Benchmarks

How Hallucinations Originate in Training

Looking Ahead

POPULAR

Never miss a beat!

Follow US

Trending

RELATED NEWS