4DNeX: Feed-Forward 4D Generative Modeling Made Easy

Computer Vision & MultiModal AI
Published: arXiv: 2508.13154v1
Authors

Zhaoxi Chen Tianqi Liu Long Zhuo Jiawei Ren Zeng Tao He Zhu Fangzhou Hong Liang Pan Ziwei Liu

Abstract

We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image. In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation by fine-tuning a pretrained video diffusion model. Specifically, 1) to alleviate the scarcity of 4D data, we construct 4DNeX-10M, a large-scale dataset with high-quality 4D annotations generated using advanced reconstruction approaches. 2) we introduce a unified 6D video representation that jointly models RGB and XYZ sequences, facilitating structured learning of both appearance and geometry. 3) we propose a set of simple yet effective adaptation strategies to repurpose pretrained video diffusion models for 4D modeling. 4DNeX produces high-quality dynamic point clouds that enable novel-view video synthesis. Extensive experiments demonstrate that 4DNeX outperforms existing 4D generation methods in efficiency and generalizability, offering a scalable solution for image-to-4D modeling and laying the foundation for generative 4D world models that simulate dynamic scene evolution.

Paper Summary

Problem
The main challenge addressed in this paper is generating 4D (dynamic 3D) scene representations from a single image. Current methods require video input or rely on computationally intensive optimization procedures, making it difficult to create a scalable solution for image-to-4D modeling.
Key Innovation
The key innovation of this work is the development of 4DNeX, a feed-forward framework that fine-tunes a pretrained video diffusion model to enable efficient image-to-4D generation. This approach addresses the scarcity of 4D data by introducing a large-scale dataset with high-quality pseudo-4D annotations and proposes a set of simple yet effective adaptation strategies to repurpose video diffusion models for 4D modeling.
Practical Impact
The practical impact of this research is the potential to create a scalable solution for image-to-4D modeling, enabling applications such as novel-view video synthesis, augmented reality (AR), and digital content creation. The proposed framework can also be used to simulate dynamic scene evolution, laying the foundation for generative 4D world models.
Analogy / Intuitive Explanation
Think of this research like trying to create a movie from a single still image. You would need to infer how the scene changes over time and what the 3D objects look like from different angles. The proposed framework uses machine learning techniques to make educated guesses about these missing pieces, allowing it to generate dynamic 3D scenes from a single image. Note: I've written my summary based on the extracted sections from the paper, using clear Markdown section headings for each part and simple, engaging language.
Paper Information
Categories:
cs.CV
Published Date:

arXiv ID:

2508.13154v1

Quick Actions