Linear Scaling Video VLMs for Long Video Understanding

Generative AI & LLMs
Published: arXiv: 2605.31598v1
Authors

Cristobal Eyzaguirre Jiajun Wu Juan Carlos Niebles

Abstract

Video vision-language models (VLMs) are increasingly used in long-horizon and streaming settings, yet most video encoders still rely on spatiotemporal self-attention, causing compute and latency to grow quadratically with the number of frames. Existing efficiency methods improve scalability but often lose accuracy relative to full self-attention, for example through aggressive frame/token dropping or coarse attention approximations. We introduce StateKV, an inference-time method that adapts pretrained long-video VLMs to linear-time video prefill by carrying cross-frame context in a fixed-capacity, importance-based recurrent state, paired with a second full per-frame cache used for decoding. Across three long-video benchmarks and seven models spanning three families and multiple scales, StateKV remains close to full self-attention and consistently outperforms dominant sliding-window / recency-based streaming approximations, without fine-tuning or architectural changes. StateKV also reduces video-prefill cost measured FLOPs, enabling stronger accuracy at a fixed compute budget by running larger models. These results suggest a practical step toward scalable long-video understanding.

Paper Summary

Problem
The main problem addressed by this research paper is the computational bottleneck in processing long videos using video vision-language models (VLMs). As the length of the video increases, the computational cost of VLMs grows quadratically, making it challenging to deploy them in real-time streaming applications.
Key Innovation
The key innovation of this paper is the introduction of StateKV, an inference-time method that adapts pretrained long-video VLMs to linear-time video prefill. StateKV carries cross-frame context in a fixed-capacity, importance-based recurrent state and uses a second full per-frame cache for decoding. This approach reduces the complexity of video-prefill from quadratic to linear in the number of processed frames.
Practical Impact
The practical impact of this research is significant, as it enables the use of VLMs in real-time streaming applications such as autonomous driving and embodied robotics. By reducing the computational cost of video-prefill, StateKV allows for the use of larger models and improves the accuracy of long-video understanding. This is particularly important in applications where models must integrate evidence over minutes or hours.
Analogy / Intuitive Explanation
Imagine trying to understand a long conversation by listening to each word in isolation. This would be difficult and would require a lot of computational resources. In contrast, StateKV is like having a brief summary of the conversation up to a certain point, which allows you to understand the context and make sense of the words that come later. This summary is updated gradually as new information is added, allowing for more efficient and accurate understanding of the conversation.
Paper Information
Categories:
cs.CV
Published Date:

arXiv ID:

2605.31598v1

Quick Actions