Predicting Where Steering Vectors Succeed

Explainable & Ethical AI

Published: arXiv: 2604.15557v1

Authors

Jayadev Billa

Abstract

Steering vectors work for some concepts and layers but fail for others, and practitioners have no way to predict which setting applies before running an intervention. We introduce the Linear Accessibility Profile (LAP), a per-layer diagnostic that repurposes the logit lens as a predictor of steering vector effectiveness. The key measure, $A_{\mathrm{lin}}$, applies the model's unembedding matrix to intermediate hidden states, requiring no training. Across 24 controlled binary concept families on five models (Pythia-2.8B to Llama-8B), peak $A_{\mathrm{lin}}$ predicts steering effectiveness at $ρ= +0.86$ to $+0.91$ and layer selection at $ρ= +0.63$ to $+0.92$. A three-regime framework explains when difference-of-means steering works, when nonlinear methods are needed, and when no method can work. An entity-steering demo confirms the prediction end-to-end: steering at the LAP-recommended layer redirects completions on Gemma-2-2B and OLMo-2-1B-Instruct, while the middle layer (the standard heuristic) has no effect on either model.

Paper Summary

Problem

Steering vectors are used to modify the behavior of language models by adding a direction to the residual stream. However, they are not always effective, and practitioners have no way to predict which settings will work before running an intervention. This can lead to trial and error, wasting time and resources.

Key Innovation

Researchers have developed the Linear Accessibility Profile (LAP), a diagnostic tool that repurposes the logit lens to predict where steering vectors will succeed. The LAP measures the accessibility of a concept at each layer of the model and predicts which layer to steer at to achieve the desired effect. This innovation is significant because it provides a systematic method for predicting the success of steering interventions, saving practitioners time and resources.

Practical Impact

The LAP has several practical applications. It can be used to predict which layer to steer at to achieve a specific effect, such as improving the truthfulness of a model. It can also be used to identify which concepts are steerable at all, allowing practitioners to focus on the most promising areas. The LAP can be used in a variety of tasks, including single-token next-token completion tasks and multi-token settings. This means that the LAP can be applied to a wide range of applications, from language translation to text summarization.

Analogy / Intuitive Explanation

Imagine you're trying to steer a car to a specific destination. You need to know which road to take and how to adjust your course to get there. The LAP is like a GPS system that helps you determine which road to take and how to adjust your course to achieve your desired outcome. Just as a GPS system uses maps and algorithms to provide turn-by-turn directions, the LAP uses the logit lens to provide a systematic method for predicting the success of steering interventions.

Paper Information

Categories:

cs.LG cs.CL

Published Date:

arXiv ID:

2604.15557v1

Quick Actions

Back to Home