A Perfectly Truthful Calibration Measure

Explainable & Ethical AI
Published: arXiv: 2508.13100v1
Authors

Jason Hartline Lunjia Hu Yifan Wu

Abstract

Calibration requires that predictions are conditionally unbiased and, therefore, reliably interpretable as probabilities. Calibration measures quantify how far a predictor is from perfect calibration. As introduced by Haghtalab et al. (2024), a calibration measure is truthful if it is minimized in expectation when a predictor outputs the ground-truth probabilities. Although predicting the true probabilities guarantees perfect calibration, in reality, when calibration is evaluated on a finite sample, predicting the truth is not guaranteed to minimize any known calibration measure. All known calibration measures incentivize predictors to lie in order to appear more calibrated on a finite sample. Such lack of truthfulness motivated Haghtalab et al. (2024) and Qiao and Zhao (2025) to construct approximately truthful calibration measures in the sequential prediction setting, but no perfectly truthful calibration measure was known to exist even in the more basic batch setting. We design a perfectly truthful calibration measure in the batch setting: averaged two-bin calibration error (ATB). In addition to being truthful, ATB is sound, complete, continuous, and quadratically related to two existing calibration measures: the smooth calibration error (smCal) and the (lower) distance to calibration (distCal). The simplicity in our definition of ATB makes it efficient and straightforward to compute. ATB allows faster estimation algorithms with significantly easier implementations than smCal and distCal, achieving improved running time and simplicity for the calibration testing problem studied by Hu et al. (2024). We also introduce a general recipe for constructing truthful measures, which proves the truthfulness of ATB as a special case and allows us to construct other truthful calibration measures such as quantile-binned l_2-ECE.

Paper Summary

Problem
The problem this paper addresses is the need for a truthful calibration measure in machine learning. Calibration measures quantify how well a model's predictions align with the true probabilities of different outcomes. However, existing calibration measures incentivize models to "lie" and appear more calibrated than they actually are. This can lead to poor performance in real-world applications.
Key Innovation
The key innovation in this paper is the design of a perfectly truthful calibration measure called Averaged Two-Bin Calibration Error (ATB). ATB is a new type of calibration error that is both truthful and computationally efficient. Unlike existing measures, ATB does not incentivize models to "lie" and can be used to evaluate the calibration of any predictor.
Practical Impact
The practical impact of this research is significant. A truthful calibration measure like ATB can be used in a wide range of applications where accurate predictions are critical, such as medical diagnosis, finance, and self-driving cars. By using ATB, developers can ensure that their models are producing reliable and trustworthy predictions.
Analogy / Intuitive Explanation
Imagine you're trying to predict the weather tomorrow based on today's conditions. A truthful calibration measure is like a thermometer that tells you how accurate your prediction is. If the thermometer says it will rain 80% of the time, but it actually rains only 60% of the time, then the thermometer is "lying" and not providing an accurate representation of the true probability. ATB is like a new type of thermometer that gives you an honest reading of how well your prediction matches the true weather patterns. In summary, this paper presents a novel approach to measuring calibration in machine learning models. The proposed Averaged Two-Bin Calibration Error (ATB) measure is both truthful and computationally efficient, making it a valuable tool for developers working on high-stakes applications.
Paper Information
Categories:
cs.LG cs.DS stat.ML
Published Date:

arXiv ID:

2508.13100v1

Quick Actions