Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory

Agentic AI
Published: arXiv: 2601.16296v1
Authors

Dohun Lee Chun-Hao Paul Huang Xuelin Chen Jong Chul Ye Duygu Ceylan Hyeonho Jeong

Abstract

Recent foundational video-to-video diffusion models have achieved impressive results in editing user provided videos by modifying appearance, motion, or camera movement. However, real-world video editing is often an iterative process, where users refine results across multiple rounds of interaction. In this multi-turn setting, current video editors struggle to maintain cross-consistency across sequential edits. In this work, we tackle, for the first time, the problem of cross-consistency in multi-turn video editing and introduce Memory-V2V, a simple, yet effective framework that augments existing video-to-video models with explicit memory. Given an external cache of previously edited videos, Memory-V2V employs accurate retrieval and dynamic tokenization strategies to condition the current editing step on prior results. To further mitigate redundancy and computational overhead, we propose a learnable token compressor within the DiT backbone that compresses redundant conditioning tokens while preserving essential visual cues, achieving an overall speedup of 30%. We validate Memory-V2V on challenging tasks including video novel view synthesis and text-conditioned long video editing. Extensive experiments show that Memory-V2V produces videos that are significantly more cross-consistent with minimal computational overhead, while maintaining or even improving task-specific performance over state-of-the-art baselines. Project page: https://dohunlee1.github.io/MemoryV2V

Paper Summary

Problem
The main problem this research paper addresses is the issue of maintaining cross-consistency in video editing, particularly in multi-turn video editing settings. Current video editing frameworks struggle to maintain consistency across sequential edits, leading to inconsistencies in the edited videos.
Key Innovation
The key innovation of this work is the introduction of Memory-V2V, a framework that augments existing video-to-video diffusion models with explicit visual memory. This allows the model to recall previous edits and maintain consistency across multiple rounds of interaction.
Practical Impact
The practical impact of this research is significant, as it enables the development of more efficient and effective video editing tools. With Memory-V2V, users can refine their video editing results across multiple rounds of interaction without worrying about consistency issues. This has applications in various domains, including entertainment, robotics simulation, and more.
Analogy / Intuitive Explanation
Imagine you're editing a video, and you want to change the camera angle in multiple scenes. With traditional video editing tools, each scene would be edited independently, resulting in inconsistencies between scenes. Memory-V2V is like a "memory" that helps the video editing model remember the previous edits, allowing it to make consistent changes across multiple scenes. This is similar to how our brains remember past experiences and use them to inform our decisions in the present.
Paper Information
Categories:
cs.CV cs.AI cs.LG
Published Date:

arXiv ID:

2601.16296v1

Quick Actions