Table of Contents
Fetching ...

BulletTime: Decoupled Control of Time and Camera Pose for Video Generation

Yiming Wang, Qihang Zhang, Shengqu Cai, Tong Wu, Jan Ackermann, Zhengfei Kuang, Yang Zheng, Frano Rajič, Siyu Tang, Gordon Wetzstein

TL;DR

To address the coupling of world time and camera motion in contemporary video diffusion models, the paper introduces a 4D-controllable framework that decouples temporal evolution from viewpoint. It conditions on continuous world-time and camera trajectories via Time-RoPE, Time-AdaLN, 4D-RoPE, and Camera-AdaLN, and trains on a synthetic dataset with independently varying time and camera factors. The authors provide extensive ablations showing the conditioning design outperforms baselines, and demonstrate robust 4D control on synthetic and real videos, achieving state-of-the-art controllability with competitive visual quality. They also release a 4D-controlled dataset and showcase practical applications such as 4D video editing and bullet-time effects.

Abstract

Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional encoding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability. See our website for video results: https://19reborn.github.io/Bullet4D/

BulletTime: Decoupled Control of Time and Camera Pose for Video Generation

TL;DR

To address the coupling of world time and camera motion in contemporary video diffusion models, the paper introduces a 4D-controllable framework that decouples temporal evolution from viewpoint. It conditions on continuous world-time and camera trajectories via Time-RoPE, Time-AdaLN, 4D-RoPE, and Camera-AdaLN, and trains on a synthetic dataset with independently varying time and camera factors. The authors provide extensive ablations showing the conditioning design outperforms baselines, and demonstrate robust 4D control on synthetic and real videos, achieving state-of-the-art controllability with competitive visual quality. They also release a 4D-controlled dataset and showcase practical applications such as 4D video editing and bullet-time effects.

Abstract

Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional encoding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability. See our website for video results: https://19reborn.github.io/Bullet4D/

Paper Structure

This paper contains 34 sections, 6 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Time- and camera-controlled 4D video generation. Given a single input video where camera motion is entangled with uniform temporal sampling (top row), our method synthesizes new videos that enable decoupled control over world time and camera pose.
  • Figure 2: Method Overview. Given a conditional input video, our diffusion model generates new videos under 4D control using world time and camera trajectory. These two signals are injected into the Diffusion Transformer through complementary modulation pathways. Time control is enabled by $\mathrm{RoPE}_t$ (a time-aware positional encoding injected into attention) and $\mathrm{MLP}_t$, which predicts the affine scale and shift used to modulate intermediate features. Camera control is introduced analogously through $\mathrm{RoPE}_c$ (a camera-aware positional encoding) and $\mathrm{MLP}_c$. The outputs of $\mathrm{RoPE}_t$ and $\mathrm{RoPE}_c$ are fused into a unified 4D positional encoding injected into the attention layers. Together, these mechanisms form a 4D-controllable DiT block capable of jointly steering temporal evolution and camera motion during generation. We train our model on a curated 4D-controlled synthetic dataset that we constructed, where temporal and camera factors vary independently across scenes, providing explicit supervision for disentangling time and camera control.
  • Figure 3: Comparison on Synthetic Videos. GT frames compared with predictions from our method and state-of-the-art novel-view synthesis models. Our method adheres most closely to the target camera conditions and produces the finest level of detail.
  • Figure 4: Qualitative Comparison of Camera- and Time-Controlled Video Generation on Real-World Videos. Qualitative comparison between our method and state-of-the-art novel-view synthesis models extended with time remapping huang2022rife. In the left example, existing methods struggle under extreme view and time changes, producing severe artifacts (ReCamMaster) and showing imprecise camera control (TrajectoryCrafter). The right example similarly illustrates strong artifacts and reduced detail from ReCamMaster, while TrajectoryCrafter again fails to follow the prescribed trajectory.
  • Figure 5: 4D Control: Camera and Time Manipulation. Our model generates videos that faithfully follow independently specified camera and time controls. Each row shows combinations of fixed or moving camera viewpoints () and fixed or changing world time (). The model correctly applies each control mode, including challenging settings such as moving camera with fixed time (bullet time effect), while preserving scene dynamics and visual coherence. These results indicate strong disentanglement between camera and world time conditioning as well as robust generalization across diverse real world inputs.
  • ...and 6 more figures