Unidentified and Confounded? Understanding Two-Tower Models for Unbiased Learning to Rank (Extended Abstract)

Explainable & Ethical AI
Published: arXiv: 2508.21698v1
Authors

Philipp Hager Onno Zoeter Maarten de Rijke

Abstract

Additive two-tower models are popular learning-to-rank methods for handling biased user feedback in industry settings. Recent studies, however, report a concerning phenomenon: training two-tower models on clicks collected by well-performing production systems leads to decreased ranking performance. This paper investigates two recent explanations for this observation: confounding effects from logging policies and model identifiability issues. We theoretically analyze the identifiability conditions of two-tower models, showing that either document swaps across positions or overlapping feature distributions are required to recover model parameters from clicks. We also investigate the effect of logging policies on two-tower models, finding that they introduce no bias when models perfectly capture user behavior. However, logging policies can amplify biases when models imperfectly capture user behavior, particularly when prediction errors correlate with document placement across positions. We propose a sample weighting technique to mitigate these effects and provide actionable insights for researchers and practitioners using two-tower models.

Paper Summary

Problem
Two-tower models are widely used in industrial settings to address click biases, but they can be affected by strong logging policies. This can lead to degrading performance, which can be problematic for companies like Booking.com that rely on these models to recommend relevant items to users. The problem is that two-tower models can be influenced by biases in the data, which can make them less effective.
Key Innovation
The key innovation in this paper is that it investigates why two-tower models perform poorly on data from strong logging policies. The researchers identified two main issues: (1) the models can be identified without document swaps through overlapping feature distributions across ranks, and (2) logging policies can amplify bias in misspecified models. They also proposed a sample weighting scheme to counteract potential logging policy influences.
Practical Impact
This research has several practical implications. Firstly, it suggests that companies should monitor click residuals for logging policy correlations to detect model misspecification. Secondly, collecting randomized data when feasible can ensure overlapping document or feature distributions across positions, which can help to mitigate bias. Finally, the researchers propose avoiding sorting by expert labels in simulation, as this can also introduce bias. By addressing these issues, companies can improve the performance of their two-tower models and provide more accurate recommendations to their users.
Analogy / Intuitive Explanation
Think of a two-tower model like a judge trying to decide which books to recommend to a reader. The judge has two pieces of information: the book's content and the reader's past behavior. However, if the judge is biased towards certain types of books or readers, their recommendations will be influenced by these biases. Similarly, two-tower models can be biased by the data they are trained on, which can lead to poor performance. By understanding and addressing these biases, companies can improve the accuracy of their recommendations and provide a better experience for their users.
Paper Information
Categories:
cs.IR
Published Date:

arXiv ID:

2508.21698v1

Quick Actions