SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence

Agentic AI

Published: arXiv: 2605.31529v1

Authors

Yulu Pan Han Yi Seongsu Ha Md Mohaiminul Islam Benjamin Zhang Lorenzo Torresani Gedas Bertasius

Abstract

True video intelligence demands more than recognizing what is visible: it requires reasoning about why events unfold, predicting what would change under different conditions, and deciding what to do next. We refer to this progression, from perception through causal reasoning and simulation to strategic planning, as Strategic Video Intelligence (SVI). No existing benchmark evaluates this capability stack: in-the-wild videos lack verifiable ground truth for causal and strategic questions, while synthetic environments sacrifice the complexity of real multi-agent systems. To bridge this gap, we introduce SVI-Bench, a large-scale benchmark that leverages team sports as a dynamic microworld, combining the complexity of real-world multi-agent interaction (10-22 agents making coordinated decisions under adversarial pressure) with the verifiability of explicit rules and definitive outcomes. SVI-Bench comprises approximately 35K hours of broadcast video, 15M annotated actions, 15K hours of expert commentary, 23K game reports, and 103K structured statistical records across basketball, soccer, and hockey, all constructed via a data engine that transforms raw game data into a dense, cross-referenced corpus. We organize evaluation into 9 tasks spanning a progressive four-pillar hierarchy: Dynamic Scene Understanding, Causal Reasoning, Strategic Simulation, and Agentic Synthesis. Evaluating strong multimodal and agentic baselines, we find a capability cliff: models perform competently on perceptual tasks, achieving approximately 73% on fine-grained action QA, but degrade sharply at each successive cognitive level. Agentic tasks prove hardest: the strongest model achieves only 5% accuracy when required to autonomously gather and integrate evidence across a corpus of 1.8M clips.

Paper Summary

Problem

Current AI systems are unable to perform Strategic Video Intelligence (SVI), which involves reasoning about why events unfold, predicting what would change under different conditions, and deciding what to do next. This capability is essential for real-world applications such as surgical teams, first responders, autonomous vehicles, and military units. Despite being at the intersection of three active frontiers (reasoning VLMs, video world models, and agentic intelligence), current AI systems struggle to perform SVI, which involves a progressive cognitive stack spanning perception, causal reasoning, simulation, and agentic synthesis.

Key Innovation

The researchers introduce SVI-Bench, a large-scale benchmark for Strategic Video Intelligence in real-world multi-agent video environments. SVI-Bench comprises ∼35K hours of broadcast video, ∼15M annotated actions, ∼15K hours of expert commentary, ∼23K game reports, and ∼103K structured statistical records across basketball, soccer, and hockey. The benchmark is organized into 9 tasks spanning a progressive four-pillar hierarchy: Dynamic Scene Understanding, Causal Reasoning, Strategic Simulation, and Agent Synthesis. The researchers also propose a data engine that aligns five modalities via temporal alignment, cross-modal entity resolution, LLM-assisted instance generation, and multi-stage quality control.

Practical Impact

The SVI-Bench benchmark has the potential to accelerate progress toward AI systems capable of strategic intelligence in complex, dynamic multi-agent environments. The benchmark can be used to evaluate and improve video models with stronger reasoning capabilities, generative video models with explicit notions of multi-agent dynamics, and multimodal agents that can plan, retrieve, and reason across video and document corpora. The researchers hope that SVI-Bench will catalyze progress toward AI systems that can reason about why events unfold, simulate what-if alternatives, and decide what to do next.

Analogy / Intuitive Explanation

Imagine watching a basketball game and trying to understand what's happening on the court. A current AI system might be able to recognize the players and their actions, but it wouldn't be able to explain why the defense is collapsing or predict what would happen if the player drove left instead of right. SVI-Bench is like a game coach's analysis of the game, where they can reason about the events unfolding on the court, simulate different scenarios, and decide what to do next. This is the kind of intelligence that SVI-Bench is trying to capture and evaluate.

Paper Information

Categories:

cs.CV

Published Date:

arXiv ID:

2605.31529v1

Quick Actions

Back to Home