Table of Contents
Fetching ...

FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis

Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, Sunghyun Cho

TL;DR

FloVD introduces an optical-flow-based framework for camera-controllable video synthesis that decouples flow generation from video rendering. By representing camera and object motions as two separate flow streams and integrating them into a flow-conditioned diffusion model, FloVD can train on arbitrary videos without ground-truth camera parameters while achieving detailed, 3D-aware camera control. The model comprises two diffusion components—OMSM for object motion and FVSM for flow-conditioned video synthesis—trained on large internal datasets and refined with a curated subset lacking camera motion. Extensive quantitative and qualitative evaluations demonstrate superior camera controllability and natural object motion compared to prior methods, with practical applications in temporally-consistent video editing and cinematic camera control.

Abstract

We present FloVD, a novel video diffusion model for camera-controllable video generation. FloVD leverages optical flow to represent the motions of the camera and moving objects. This approach offers two key benefits. Since optical flow can be directly estimated from videos, our approach allows for the use of arbitrary training videos without ground-truth camera parameters. Moreover, as background optical flow encodes 3D correlation across different viewpoints, our method enables detailed camera control by leveraging the background motion. To synthesize natural object motion while supporting detailed camera control, our framework adopts a two-stage video synthesis pipeline consisting of optical flow generation and flow-conditioned video synthesis. Extensive experiments demonstrate the superiority of our method over previous approaches in terms of accurate camera control and natural object motion synthesis.

FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis

TL;DR

FloVD introduces an optical-flow-based framework for camera-controllable video synthesis that decouples flow generation from video rendering. By representing camera and object motions as two separate flow streams and integrating them into a flow-conditioned diffusion model, FloVD can train on arbitrary videos without ground-truth camera parameters while achieving detailed, 3D-aware camera control. The model comprises two diffusion components—OMSM for object motion and FVSM for flow-conditioned video synthesis—trained on large internal datasets and refined with a curated subset lacking camera motion. Extensive quantitative and qualitative evaluations demonstrate superior camera controllability and natural object motion compared to prior methods, with practical applications in temporally-consistent video editing and cinematic camera control.

Abstract

We present FloVD, a novel video diffusion model for camera-controllable video generation. FloVD leverages optical flow to represent the motions of the camera and moving objects. This approach offers two key benefits. Since optical flow can be directly estimated from videos, our approach allows for the use of arbitrary training videos without ground-truth camera parameters. Moreover, as background optical flow encodes 3D correlation across different viewpoints, our method enables detailed camera control by leveraging the background motion. To synthesize natural object motion while supporting detailed camera control, our framework adopts a two-stage video synthesis pipeline consisting of optical flow generation and flow-conditioned video synthesis. Extensive experiments demonstrate the superiority of our method over previous approaches in terms of accurate camera control and natural object motion synthesis.

Paper Structure

This paper contains 32 sections, 3 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: (Left) Our method using optical flow enables video synthesis with complex camera movements (dolly zoom). (Right) Synthesized video frames with 'zoom-out' camera motion. X-t slice reveals pixel value changes along the red line. Our method shows natural object motion and accurate camera control, while CameraCtrl he2024cameractrl produces an object without motions, and MotionCtrl wang2024motionctrl produces artifacts.
  • Figure 2: Overview of FloVD. Given an image and camera parameters, our framework synthesizes video frames following the input camera trajectory. To this end, we synthesize two sets of optical flow maps that represent camera and object motions. Then, two optical flow maps are integrated and fed into the flow-conditioned video synthesis model, enabling camera-controllable video generation.
  • Figure 3: Network architectures of OMSM and FVSM.
  • Figure 4: Object flow maps synthesized by OMSM, which is trained on the full dataset (left) and the curated dataset (right), respectively. White indicates optical flow vectors with no motion.
  • Figure 5: Qualitative comparison of camera control using the RealEstate10K test dataset zhou2018stereo. MotionCtrl wang2024motionctrl often fails to follow the input camera parameters. Notably, our method shows accurate camera control results despite not using camera parameters in training.
  • ...and 4 more figures