Fast Data Attribution for Text-to-Image Models

Generative AI & LLMs
Published: arXiv: 2511.10721v1
Authors

Sheng-Yu Wang Aaron Hertzmann Alexei A Efros Richard Zhang Jun-Yan Zhu

Abstract

Data attribution for text-to-image models aims to identify the training images that most significantly influenced a generated output. Existing attribution methods involve considerable computational resources for each query, making them impractical for real-world applications. We propose a novel approach for scalable and efficient data attribution. Our key idea is to distill a slow, unlearning-based attribution method to a feature embedding space for efficient retrieval of highly influential training images. During deployment, combined with efficient indexing and search methods, our method successfully finds highly influential images without running expensive attribution algorithms. We show extensive results on both medium-scale models trained on MSCOCO and large-scale Stable Diffusion models trained on LAION, demonstrating that our method can achieve better or competitive performance in a few seconds, faster than existing methods by 2,500x - 400,000x. Our work represents a meaningful step towards the large-scale application of data attribution methods on real-world models such as Stable Diffusion.

Paper Summary

Problem
Data attribution for text-to-image models is a challenge that aims to identify the training images that most significantly influenced a generated output. However, existing attribution methods are computationally expensive and impractical for real-world applications, making it difficult to apply them in a timely manner.
Key Innovation
This research proposes a novel approach to scalable and efficient data attribution. The key idea is to distill a slow, unlearning-based attribution method to a feature embedding space for efficient retrieval of highly influential training images. This approach enables fast deployment and significantly reduces the runtime and storage cost of the data attribution algorithm.
Practical Impact
The practical impact of this research is that it makes data attribution a feasible solution for real-world applications such as compensation models, which could help address the timely issue surrounding the authorship of generative content. The method can also be applied to other widely used models, making attribution more explainable to end-users. Furthermore, the approach can be used to identify highly influential training images, which can be useful for understanding model behavior and improving model performance.
Analogy / Intuitive Explanation
Imagine you're trying to understand how a complex machine works. Data attribution is like trying to identify the specific parts of the machine that are most responsible for its behavior. Existing methods are like trying to take apart the entire machine to understand how each part works, which is time-consuming and impractical. The new approach is like distilling the machine's behavior into a simplified model that can be easily understood and analyzed, making it much faster and more efficient.
Paper Information
Categories:
cs.CV cs.LG
Published Date:

arXiv ID:

2511.10721v1

Quick Actions