FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis
Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, Sunghyun Cho
TL;DR
FloVD introduces an optical-flow-based framework for camera-controllable video synthesis that decouples flow generation from video rendering. By representing camera and object motions as two separate flow streams and integrating them into a flow-conditioned diffusion model, FloVD can train on arbitrary videos without ground-truth camera parameters while achieving detailed, 3D-aware camera control. The model comprises two diffusion components—OMSM for object motion and FVSM for flow-conditioned video synthesis—trained on large internal datasets and refined with a curated subset lacking camera motion. Extensive quantitative and qualitative evaluations demonstrate superior camera controllability and natural object motion compared to prior methods, with practical applications in temporally-consistent video editing and cinematic camera control.
Abstract
We present FloVD, a novel video diffusion model for camera-controllable video generation. FloVD leverages optical flow to represent the motions of the camera and moving objects. This approach offers two key benefits. Since optical flow can be directly estimated from videos, our approach allows for the use of arbitrary training videos without ground-truth camera parameters. Moreover, as background optical flow encodes 3D correlation across different viewpoints, our method enables detailed camera control by leveraging the background motion. To synthesize natural object motion while supporting detailed camera control, our framework adopts a two-stage video synthesis pipeline consisting of optical flow generation and flow-conditioned video synthesis. Extensive experiments demonstrate the superiority of our method over previous approaches in terms of accurate camera control and natural object motion synthesis.
