Dataset Distillation for Pre-Trained Self-Supervised Vision Models

AI in healthcare
Published: arXiv: 2511.16674v1
Authors

George Cazenavette Antonio Torralba Vincent Sitzmann

Abstract

The task of dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples. Existing distillation methods focus on synthesizing datasets that enable training randomly initialized models. In contrast, state-of-the-art vision approaches are increasingly building on large, pre-trained self-supervised models rather than training from scratch. In this paper, we investigate the problem of distilling datasets that enable us to optimally train linear probes on top of such large, pre-trained vision models. We introduce a method of dataset distillation for this task called Linear Gradient Matching that optimizes the synthetic images such that, when passed through a pre-trained feature extractor, they induce gradients in the linear classifier similar to those produced by the real data. Our method yields synthetic data that outperform all real-image baselines and, remarkably, generalize across pre-trained vision models, enabling us, for instance, to train a linear CLIP probe that performs competitively using a dataset distilled via a DINO backbone. Further, we show that our distilled datasets are exceptionally effective for fine-grained classification and provide a valuable tool for model interpretability, predicting, among other things, how similar two models' embedding spaces are under the platonic representation hypothesis or whether a model is sensitive to spurious correlations in adversarial datasets.

Paper Summary

Problem
The main problem this research paper addresses is how to create a small set of synthetic images that can train a model to perform well on a large dataset of real images. This is known as dataset distillation. However, most existing methods focus on training models from scratch, whereas modern computer vision approaches rely on pre-trained self-supervised models. The authors aim to develop a method that can distill datasets for training linear probes on top of these pre-trained models.
Key Innovation
The key innovation of this research is a new method called Linear Gradient Matching. This method optimizes synthetic images to induce gradients in a linear classifier similar to those produced by real data when passed through a pre-trained feature extractor. The authors claim that their method outperforms all real-image baselines and generalizes across pre-trained vision models.
Practical Impact
This research has several practical implications. Firstly, it enables the training of linear classifiers on top of pre-trained self-supervised models using a tiny set of synthetic images, which can be a valuable tool for model interpretability and understanding how pre-trained models "see" the world. Secondly, the distilled datasets can be used for fine-grained classification tasks, where the authors show that their method outperforms real-image baselines by a large margin. Finally, the method can be used to predict how well different models align and to elucidate a model's ability to generalize beyond its training distribution.
Analogy / Intuitive Explanation
Imagine you're trying to teach a child to recognize different animals. You can either show them a large collection of pictures of animals and ask them to identify each one, or you can create a few simple images that capture the essential features of each animal (e.g., a picture of a cat with whiskers and ears). The latter approach is similar to dataset distillation, where we create a small set of synthetic images that can train a model to recognize different animals. In this research, the authors develop a method to create these synthetic images in a way that optimizes the model's performance on a large dataset of real images.
Paper Information
Categories:
cs.CV cs.AI cs.LG
Published Date:

arXiv ID:

2511.16674v1

Quick Actions