CURE: Controlled Unlearning for Robust Embeddings -- Mitigating Conceptual Shortcuts in Pre-Trained Language Models

Explainable & Ethical AI
Published: arXiv: 2509.05230v1
Authors

Aysenur Kocak Shuo Yang Bardh Prenkaj Gjergji Kasneci

Abstract

Pre-trained language models have achieved remarkable success across diverse applications but remain susceptible to spurious, concept-driven correlations that impair robustness and fairness. In this work, we introduce CURE, a novel and lightweight framework that systematically disentangles and suppresses conceptual shortcuts while preserving essential content information. Our method first extracts concept-irrelevant representations via a dedicated content extractor reinforced by a reversal network, ensuring minimal loss of task-relevant information. A subsequent controllable debiasing module employs contrastive learning to finely adjust the influence of residual conceptual cues, enabling the model to either diminish harmful biases or harness beneficial correlations as appropriate for the target task. Evaluated on the IMDB and Yelp datasets using three pre-trained architectures, CURE achieves an absolute improvement of +10 points in F1 score on IMDB and +2 points on Yelp, while introducing minimal computational overhead. Our approach establishes a flexible, unsupervised blueprint for combating conceptual biases, paving the way for more reliable and fair language understanding systems.

Paper Summary

Problem
The main problem addressed by this research is that pre-trained language models (PLMs) are susceptible to conceptual shortcuts, which are spurious correlations between features and labels that impair their robustness and fairness. These biases can lead to inaccurate predictions in applications such as medical diagnosis or automated recruitment systems.
Key Innovation
The innovation proposed in this work is a novel framework called CURE (Controlled Unlearning for Robust Embeddings), which systematically disentangles and suppresses conceptual shortcuts while preserving essential content information. CURE achieves this without relying on prior knowledge or data augmentation, reducing training time by an order of magnitude compared to LLM-driven debiasing approaches.
Practical Impact
The practical impact of this research is that it provides a lightweight and efficient framework for mitigating conceptual biases in pre-trained language models. This can lead to more reliable and fair language understanding systems across various applications, such as natural language processing, sentiment analysis, or text classification. By reducing the influence of spurious correlations, CURE enables PLMs to generalize better to unseen data and make more accurate predictions.
Analogy / Intuitive Explanation
Think of a pre-trained language model like a chef who has learned to make pizza by observing many examples. However, this chef has also picked up some bad habits, such as always assuming that any mention of "food" is positive. CURE is like a special sauce that helps the chef unlearn these biases and focus on the essential ingredients (content information) while still being able to recognize good pizzas (task-relevant features). This way, the chef can make more accurate predictions about different types of food without relying on shortcuts.
Paper Information
Categories:
cs.CL cs.AI cs.LG
Published Date:

arXiv ID:

2509.05230v1

Quick Actions