MonoLoss: A Training Objective for Interpretable Monosemantic Representations

AI in healthcare
Published: arXiv: 2602.12403v1
Authors

Ali Nasiri-Sarvi Anh Tien Nguyen Hassan Rivaz Dimitris Samaras Mahdi S. Hosseini

Abstract

Sparse autoencoders (SAEs) decompose polysemantic neural representations, where neurons respond to multiple unrelated concepts, into monosemantic features that capture single, interpretable concepts. However, standard training objectives only weakly encourage this decomposition, and existing monosemanticity metrics require pairwise comparisons across all dataset samples, making them inefficient during training and evaluation. We study a recent MonoScore metric and derive a single-pass algorithm that computes exactly the same quantity, but with a cost that grows linearly, rather than quadratically, with the number of dataset images. On OpenImagesV7, we achieve up to a 1200x speedup wall-clock speedup in evaluation and 159x during training, while adding only ~4% per-epoch overhead. This allows us to treat MonoScore as a training signal: we introduce the Monosemanticity Loss (MonoLoss), a plug-in objective that directly rewards semantically consistent activations for learning interpretable monosemantic representations. Across SAEs trained on CLIP, SigLIP2, and pretrained ViT features, using BatchTopK, TopK, and JumpReLU SAEs, MonoLoss increases MonoScore for most latents. MonoLoss also consistently improves class purity (the fraction of a latent's activating images belonging to its dominant class) across all encoder and SAE combinations, with the largest gain raising baseline purity from 0.152 to 0.723. Used as an auxiliary regularizer during ResNet-50 and CLIP-ViT-B/32 finetuning, MonoLoss yields up to 0.6\% accuracy gains on ImageNet-1K and monosemantic activating patterns on standard benchmark datasets. The code is publicly available at https://github.com/AtlasAnalyticsLab/MonoLoss.

Paper Summary

Problem
The main problem this paper addresses is the lack of interpretability in pre-trained vision models. These models are widely used to extract features from images, but the process remains largely opaque, making it difficult to understand why certain features are being extracted. This is due to polysemanticity, where a single unit or feature responds to multiple, often unrelated concepts.
Key Innovation
The key innovation of this paper is the introduction of a new training objective called MonoLoss. This objective is designed to encourage the extraction of monosemantic features, which are features that respond to a single, interpretable concept. MonoLoss is a simple, plug-and-play objective that can be added to standard training procedures to improve the interpretability of pre-trained vision models.
Practical Impact
The practical impact of this research is significant. By introducing MonoLoss, pre-trained vision models can be fine-tuned to extract more interpretable features, which can lead to improved performance on various tasks. The paper shows that MonoLoss can be used to fine-tune pre-trained models, such as ResNet-50 and CLIP-ViT-B/32, to achieve higher accuracy on ImageNet-1K, CIFAR-10, and CIFAR-100 with minimal computational overhead.
Analogy / Intuitive Explanation
Imagine you're trying to understand what a particular image is depicting. A polysemantic feature would respond to multiple concepts, such as a cat, a tree, and a car, all at the same time. This makes it difficult to understand what the feature is actually responding to. A monosemantic feature, on the other hand, would respond to a single concept, such as a cat. MonoLoss is like a training signal that encourages the model to extract features that are more like monosemantic features, making it easier to understand what the model is responding to.
Paper Information
Categories:
cs.CV
Published Date:

arXiv ID:

2602.12403v1

Quick Actions