Table of Contents
Fetching ...

Reangle-A-Video: 4D Video Generation as Video-to-Video Translation

Hyeonho Jeong, Suhyeon Lee, Jong Chul Ye

TL;DR

Reangle-A-Video tackles the challenge of generating synchronized multi-view videos from a single video by reframing 4D synthesis as video-to-video translation. It learns view-invariant motion through a synchronized, few-shot fine-tuning of a pre-trained image-to-video diffusion model with data augmentation via point-based warping, followed by a warp-and-inpaint inference-time step that enforces cross-view consistency. The method employs LoRA for lightweight training, a masked diffusion loss to preserve priors, and stochastic control guidance with DUSt3R to ensure multi-view coherence in starting images and outputs, achieving static view transport and dynamic camera control on real-world scenes. This approach reduces reliance on large, curated 4D priors and provides a practical, publicly reproducible path toward open-domain 4D video synthesis.

Abstract

We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video. Unlike mainstream approaches that train multi-view video diffusion models on large-scale 4D datasets, our method reframes the multi-view video generation task as video-to-videos translation, leveraging publicly available image and video diffusion priors. In essence, Reangle-A-Video operates in two stages. (1) Multi-View Motion Learning: An image-to-video diffusion transformer is synchronously fine-tuned in a self-supervised manner to distill view-invariant motion from a set of warped videos. (2) Multi-View Consistent Image-to-Images Translation: The first frame of the input video is warped and inpainted into various camera perspectives under an inference-time cross-view consistency guidance using DUSt3R, generating multi-view consistent starting images. Extensive experiments on static view transport and dynamic camera control show that Reangle-A-Video surpasses existing methods, establishing a new solution for multi-view video generation. We will publicly release our code and data. Project page: https://hyeonho99.github.io/reangle-a-video/

Reangle-A-Video: 4D Video Generation as Video-to-Video Translation

TL;DR

Reangle-A-Video tackles the challenge of generating synchronized multi-view videos from a single video by reframing 4D synthesis as video-to-video translation. It learns view-invariant motion through a synchronized, few-shot fine-tuning of a pre-trained image-to-video diffusion model with data augmentation via point-based warping, followed by a warp-and-inpaint inference-time step that enforces cross-view consistency. The method employs LoRA for lightweight training, a masked diffusion loss to preserve priors, and stochastic control guidance with DUSt3R to ensure multi-view coherence in starting images and outputs, achieving static view transport and dynamic camera control on real-world scenes. This approach reduces reliance on large, curated 4D priors and provides a practical, publicly reproducible path toward open-domain 4D video synthesis.

Abstract

We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video. Unlike mainstream approaches that train multi-view video diffusion models on large-scale 4D datasets, our method reframes the multi-view video generation task as video-to-videos translation, leveraging publicly available image and video diffusion priors. In essence, Reangle-A-Video operates in two stages. (1) Multi-View Motion Learning: An image-to-video diffusion transformer is synchronously fine-tuned in a self-supervised manner to distill view-invariant motion from a set of warped videos. (2) Multi-View Consistent Image-to-Images Translation: The first frame of the input video is warped and inpainted into various camera perspectives under an inference-time cross-view consistency guidance using DUSt3R, generating multi-view consistent starting images. Extensive experiments on static view transport and dynamic camera control show that Reangle-A-Video surpasses existing methods, establishing a new solution for multi-view video generation. We will publicly release our code and data. Project page: https://hyeonho99.github.io/reangle-a-video/

Paper Structure

This paper contains 27 sections, 7 equations, 15 figures, 4 tables, 2 algorithms.

Figures (15)

  • Figure 1: From a single monocular video of any scene, Reangle-A-Video generates synchronized videos from diverse camera viewpoints or movements without relying on any multi-view generative prior—using only single fine-tuning of a video generator. The first row shows the input video, while the rows below present videos generated by Reangle-A-Video. (Left): Static view transport results. (Right): Dynamic camera control results. Full video examples are available on our project page: https://hyeonho99.github.io/reangle-a-video/
  • Figure 2: Qualitative results on static view transport (left) & dynamic camera control (right). Click with Acrobat Reader to play videos.
  • Figure 3: Multi-view motion learning pipelines for (a) Static view transport and (b) Dynamic camera control. For both tasks, we distill view-robust motion of the underlying scene to a pre-trained MM-DiT video model yang2024cogvideox, using all visible pixels within the sampled videos. This few-shot, self-supervised training optimizes only the LoRA layers hu2022loraryu2023low, enabling lightweight training.
  • Figure 4: Multi-view consistent image inpainting using stochastic control guidance. In experiments, we set $S=25$.
  • Figure 5: Qualitative inpainting comparisons. We compare naive inpainting to inpainting with stochastic control guidance.
  • ...and 10 more figures