Singular Value Few-shot Adaptation of Vision-Language Models

Computer Vision & MultiModal AI
Published: arXiv: 2509.03740v1
Authors

Taha Koleilat Hassan Rivaz Yiming Xiao

Abstract

Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present \textbf{CLIP-SVD}, a novel \textit{multi-modal} and \textit{parameter-efficient} adaptation technique that leverages Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only \textbf{0.04\%} of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. The code is publicly available at https://github.com/HealthX-Lab/CLIP-SVD.

Paper Summary

Problem
The problem this paper addresses is how to adapt vision-language models (VLMs) like CLIP to new fine-grained domains with minimal computational overhead and without compromising their generalization ability.
Key Innovation
The key innovation of this work is the introduction of a novel multi-modal and parameter-efficient adaptation technique called CLIP-SVD, which leverages Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. This design enables enhanced adaptation performance using only 0.04% of the model's total parameters and better preservation of its generalization ability.
Practical Impact
This research has significant practical impact as it provides a way to adapt VLMs like CLIP to new domains with minimal computational overhead, which is crucial for real-world applications where data is limited and computational resources are scarce. The state-of-the-art classification results achieved by CLIP-SVD on 11 natural and 10 biomedical datasets demonstrate its effectiveness in both accuracy and generalization under few-shot settings.
Analogy / Intuitive Explanation
Imagine trying to adapt a camera to take pictures of a new type of flower that has never been seen before. The camera (CLIP) is pre-trained to recognize various types of flowers, but it needs to be fine-tuned to capture the unique features of this new type of flower. CLIP-SVD is like a special lens that adjusts the camera's settings (singular values) to focus on the specific features of the new flower, without changing the overall structure of the camera. This allows the camera to take high-quality pictures of the new flower with minimal adjustments, which is similar to how CLIP-SVD adapts VLMs like CLIP to new domains.
Paper Information
Categories:
cs.CV cs.CL
Published Date:

arXiv ID:

2509.03740v1

Quick Actions