WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception

Computer Vision & MultiModal AI
Published: arXiv: 2508.15720v1
Authors

Zhiheng Liu Xueqing Deng Shoufa Chen Angtian Wang Qiushan Guo Mingfei Han Zeyue Xue Mengzhao Chen Ping Luo Linjie Yang

Abstract

Generative video modeling has made significant strides, yet ensuring structural and temporal consistency over long sequences remains a challenge. Current methods predominantly rely on RGB signals, leading to accumulated errors in object structure and motion over extended durations. To address these issues, we introduce WorldWeaver, a robust framework for long video generation that jointly models RGB frames and perceptual conditions within a unified long-horizon modeling scheme. Our training framework offers three key advantages. First, by jointly predicting perceptual conditions and color information from a unified representation, it significantly enhances temporal consistency and motion dynamics. Second, by leveraging depth cues, which we observe to be more resistant to drift than RGB, we construct a memory bank that preserves clearer contextual information, improving quality in long-horizon video generation. Third, we employ segmented noise scheduling for training prediction groups, which further mitigates drift and reduces computational cost. Extensive experiments on both diffusion- and rectified flow-based models demonstrate the effectiveness of WorldWeaver in reducing temporal drift and improving the fidelity of generated videos.

Paper Summary

Problem
Generative video modeling has made significant progress, but ensuring structural and temporal consistency over long sequences remains a challenge. Current methods predominantly rely on RGB signals, leading to accumulated errors in object structure and motion over extended durations.
Key Innovation
The research introduces WorldWeaver, a robust framework for long video generation that jointly models RGB frames and perceptual conditions within a unified long-horizon modeling scheme. This framework offers three key advantages: it enhances temporal consistency and motion dynamics, preserves clearer contextual information, and reduces computational cost.
Practical Impact
WorldWeaver has the potential to be applied in various real-world scenarios, such as video editing, special effects, and robotics. It can also be used to improve the quality of generated videos in applications like virtual reality, gaming, and surveillance. By reducing temporal drift and improving fidelity, WorldWeaver can enable more accurate and realistic video generation, which can have significant impacts in various industries.
Analogy / Intuitive Explanation
Imagine trying to predict the trajectory of a thrown ball. If you only look at the color and texture of the ball, you might get a good prediction for a short time, but as the ball moves further and faster, small errors in your prediction can accumulate and make it difficult to accurately predict the ball's path. WorldWeaver is like a more advanced version of this prediction system, where it also considers the ball's depth and motion to make more accurate predictions over longer periods.
Paper Information
Categories:
cs.CV
Published Date:

arXiv ID:

2508.15720v1

Quick Actions