Planning with Sketch-Guided Verification for Physics-Aware Video Generation

Generative AI & LLMs
Published: arXiv: 2511.17450v1
Authors

Yidong Huang Zun Wang Han Lin Dong-Ki Kim Shayegan Omidshafiei Jaehong Yoon Yue Zhang Mohit Bansal

Abstract

Recent video generation approaches increasingly rely on planning intermediate control signals such as object trajectories to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement which requires multiple calls to the video generator, incuring high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with more dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision-language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to the trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that our method significantly improves motion quality, physical realism, and long-term consistency compared to competitive baselines while being substantially more efficient. Our ablation study further shows that scaling up the number of trajectory candidates consistently enhances overall performance.

Paper Summary

Problem
The main problem addressed in this research paper is the challenge of generating videos with physically realistic and temporally consistent motion. Current video generation models can produce impressive visual quality and semantic alignment, but they often struggle to interpret fine-grained motion instructions and generate sequences that adhere to plausible physical dynamics.
Key Innovation
The key innovation in this work is a training-free, sketch-verification-based planning framework called SketchVerify. This framework improves motion planning quality by introducing a test-time sampling and verification loop that iteratively refines motion plans using verification on lightweight video sketches. This approach decouples refinement from the diffusion backbone and verifies motion at the sketch or layout level, avoiding the heavy overhead of full-generation-based iterative updates.
Practical Impact
The practical impact of this research is significant, as it enables the efficient generation of videos with physically realistic and temporally consistent motion. This has applications in various fields, including robotic manipulation, autonomous driving, and game content creation. The framework can be used to generate videos that adhere to physical laws and instructions, which can be particularly useful in scenarios where safety and accuracy are critical.
Analogy / Intuitive Explanation
Imagine you're trying to draw a picture of a cat moving from one side of the room to the other. A traditional video generation model might try to draw the entire picture in one go, which can result in a confusing and unrealistic image. SketchVerify, on the other hand, works by breaking down the drawing process into smaller steps, where it first sketches the cat's position and movement, and then refines the sketch to ensure that it adheres to the physical laws of motion. This approach allows for more accurate and realistic video generation, and can be applied to a wide range of scenarios where motion is involved.
Paper Information
Categories:
cs.CV cs.AI cs.CL
Published Date:

arXiv ID:

2511.17450v1

Quick Actions