Table of Contents
Fetching ...

MVTokenFlow: High-quality 4D Content Generation using Multiview Token Flow

Hanzhuo Huang, Yuan Liu, Ge Zheng, Jiepeng Wang, Zhiyang Dou, Sibei Yang

TL;DR

MVTokenFlow tackles the challenge of generating high-quality, temporally coherent 4D content from monocular videos by coupling Era3D-based multiview diffusion to establish spatially consistent multiview frames with a coarse dynamic Gaussian field, with a second stage that regenerates frames guided by rendered 2D flows to enforce temporal consistency, followed by refinement of the 4D field. The method further refines the 4D field by leveraging token flow to reuse cross-frame tokens and 2D flow guidance, achieving sharper geometry and smoother motion. Quantitative and qualitative results on the Consistent4D dataset and self-collected clips show improvements over baselines in view synthesis accuracy, spatial fidelity, and temporal coherence, including novel-view consistency. This approach offers a practical, efficient pathway for 4D content creation from monocular input, enabling reliable rendering across arbitrary viewpoints and times.

Abstract

In this paper, we present MVTokenFlow for high-quality 4D content creation from monocular videos. Recent advancements in generative models such as video diffusion models and multiview diffusion models enable us to create videos or 3D models. However, extending these generative models for dynamic 4D content creation is still a challenging task that requires the generated content to be consistent spatially and temporally. To address this challenge, MVTokenFlow utilizes the multiview diffusion model to generate multiview images on different timesteps, which attains spatial consistency across different viewpoints and allows us to reconstruct a reasonable coarse 4D field. Then, MVTokenFlow further regenerates all the multiview images using the rendered 2D flows as guidance. The 2D flows effectively associate pixels from different timesteps and improve the temporal consistency by reusing tokens in the regeneration process. Finally, the regenerated images are spatiotemporally consistent and utilized to refine the coarse 4D field to get a high-quality 4D field. Experiments demonstrate the effectiveness of our design and show significantly improved quality than baseline methods.

MVTokenFlow: High-quality 4D Content Generation using Multiview Token Flow

TL;DR

MVTokenFlow tackles the challenge of generating high-quality, temporally coherent 4D content from monocular videos by coupling Era3D-based multiview diffusion to establish spatially consistent multiview frames with a coarse dynamic Gaussian field, with a second stage that regenerates frames guided by rendered 2D flows to enforce temporal consistency, followed by refinement of the 4D field. The method further refines the 4D field by leveraging token flow to reuse cross-frame tokens and 2D flow guidance, achieving sharper geometry and smoother motion. Quantitative and qualitative results on the Consistent4D dataset and self-collected clips show improvements over baselines in view synthesis accuracy, spatial fidelity, and temporal coherence, including novel-view consistency. This approach offers a practical, efficient pathway for 4D content creation from monocular input, enabling reliable rendering across arbitrary viewpoints and times.

Abstract

In this paper, we present MVTokenFlow for high-quality 4D content creation from monocular videos. Recent advancements in generative models such as video diffusion models and multiview diffusion models enable us to create videos or 3D models. However, extending these generative models for dynamic 4D content creation is still a challenging task that requires the generated content to be consistent spatially and temporally. To address this challenge, MVTokenFlow utilizes the multiview diffusion model to generate multiview images on different timesteps, which attains spatial consistency across different viewpoints and allows us to reconstruct a reasonable coarse 4D field. Then, MVTokenFlow further regenerates all the multiview images using the rendered 2D flows as guidance. The 2D flows effectively associate pixels from different timesteps and improve the temporal consistency by reusing tokens in the regeneration process. Finally, the regenerated images are spatiotemporally consistent and utilized to refine the coarse 4D field to get a high-quality 4D field. Experiments demonstrate the effectiveness of our design and show significantly improved quality than baseline methods.

Paper Structure

This paper contains 19 sections, 1 equation, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Given an input monocular video containing a foreground dynamic object (left), MVTokenFlow generates a 4D video represented by a dynamic 3D Gaussian field (right) by utilizing a multiview diffusion model and a token propagation method to improve both the spatial and temporal consistency. On the right, we also show the colors of these Gaussian spheres and the rendered normal maps besides the rendered RGB images.
  • Figure 2: Overview. Given an input video that can be generated by video diffusion models, we first apply the Era3D li2024era3d to generate the multiview-consistent images and normal maps for each timestep. Then, we reconstruct a coarse dynamic 3D Gaussian field field from the generated multiview images. After that, we use the coarse dynamic 3D Gaussian field to render 2D flows to guide the re-generation of the multiview images of Era3D, which greatly improves the temporal consistency and image quality. Finally, the regenerated images are used in the refinement of our dynamic 3D Gaussian field to improve the quality.
  • Figure 3: Qualitative comparison on temporal consistency of our method with baseline methods, Consistent4D jiang2023consistent4d, SC4D wu2024sc4d, and STAG4D zeng2024stag4d.
  • Figure 4: Qualitative comparison on spatial consistency of our method with baseline methods, Consistent4D jiang2023consistent4d, SC4D wu2024sc4d, and STAG4D zeng2024stag4d.
  • Figure 5: Ablation study of the overall architecture. The four parts illustrate (a) Input viewpoint. (b) Our final results. (c) The intermediate outcome from our coarse dynamic 3D field. (d) Result without flow loss.
  • ...and 8 more figures