Dependence-Aware Label Aggregation for LLM-as-a-Judge via Ising Models

Explainable & Ethical AI
Published: arXiv: 2601.22336v1
Authors

Krishnakumar Balasubramanian Aleksandr Podkopaev Shiva Prasad Kasiviswanathan

Abstract

Large-scale AI evaluation increasingly relies on aggregating binary judgments from $K$ annotators, including LLMs used as judges. Most classical methods, e.g., Dawid-Skene or (weighted) majority voting, assume annotators are conditionally independent given the true label $Y\in\{0,1\}$, an assumption often violated by LLM judges due to shared data, architectures, prompts, and failure modes. Ignoring such dependencies can yield miscalibrated posteriors and even confidently incorrect predictions. We study label aggregation through a hierarchy of dependence-aware models based on Ising graphical models and latent factors. For class-dependent Ising models, the Bayes log-odds is generally quadratic in votes; for class-independent couplings, it reduces to a linear weighted vote with correlation-adjusted parameters. We present finite-$K$ examples showing that methods based on conditional independence can flip the Bayes label despite matching per-annotator marginals. We prove separation results demonstrating that these methods remain strictly suboptimal as the number of judges grows, incurring nonvanishing excess risk under latent factors. Finally, we evaluate the proposed method on three real-world datasets, demonstrating improved performance over the classical baselines.

Paper Summary

Problem
When evaluating the performance of AI systems, researchers often rely on aggregating the judgments of multiple annotators, including large language models (LLMs) used as judges. However, these LLMs are not independent of each other, as they share data, architectures, and other factors that can lead to correlated judgments. This correlation can result in systematically miscalibrated predictions and even confident incorrect predictions.
Key Innovation
The authors propose a new model hierarchy based on Ising graphical models and latent factors to address the problem of dependence between annotators. This model hierarchy includes three types of models: a conditional independence model, a class-independent Ising model, and a class-dependent Ising model. The class-dependent Ising model allows for class-specific interactions between annotators, which enables it to capture the dependence between annotators more accurately.
Practical Impact
This research has significant practical implications for AI evaluation and development. By accounting for the dependence between annotators, this work can improve the accuracy and reliability of AI performance evaluations. This, in turn, can lead to better AI systems that are more trustworthy and effective in real-world applications.
Analogy / Intuitive Explanation
Imagine you have multiple people trying to guess the outcome of a coin toss. Each person has their own opinion, but they may also be influenced by the opinions of others. If you simply average their opinions, you may get a misleading result. However, if you take into account the fact that they are influencing each other, you can get a more accurate estimate of the true outcome. This is similar to what the authors are doing in this paper, except instead of people, they are dealing with large language models and their judgments.
Paper Information
Categories:
stat.ML cs.LG stat.ME
Published Date:

arXiv ID:

2601.22336v1

Quick Actions