Why Language Models Hallucinate

Explainable & Ethical AI
Published: arXiv: 2509.04664v1
Authors

Adam Tauman Kalai Ofir Nachum Santosh S. Vempala Edwin Zhang

Abstract

Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.

Paper Summary

Problem
Language models, like students on a difficult exam, sometimes "guess" when they're uncertain, producing plausible but incorrect statements instead of admitting uncertainty. This phenomenon is known as "hallucination" and can undermine trust in these AI systems.
Key Innovation
The paper argues that hallucinations are not mysterious errors, but rather originate from the way language models are trained and evaluated. The researchers show that these errors arise naturally due to the minimization of cross-entropy loss during pretraining, and persist through post-training because many evaluations reward guessing over acknowledging uncertainty.
Practical Impact
To address this issue, the paper suggests modifying mainstream evaluation benchmarks to prioritize accuracy over penalizing uncertain responses. By doing so, language models will be incentivized to produce more accurate and trustworthy outputs, rather than relying on guesses. This change can have a significant impact on the development of AI systems that are more reliable and transparent.
Analogy / Intuitive Explanation
Imagine you're taking an exam, and you're not sure of the answer to a question. Do you A) take a wild guess, B) admit you don't know, or C) leave it blank? Most people would choose B or C, as honesty is usually rewarded in such situations. Similarly, language models should be incentivized to produce accurate outputs by acknowledging uncertainty, rather than resorting to guessing and potentially producing incorrect information. By changing the way we evaluate these AI systems, we can promote more reliable and trustworthy interactions with humans.
Paper Information
Categories:
cs.CL
Published Date:

arXiv ID:

2509.04664v1

Quick Actions