Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion
Ge Ya Luo, Zhi Hao Luo, Anthony Gosselin, Alexia Jolicoeur-Martineau, Christopher Pal
TL;DR
Ctrl-V tackles the challenge of controllable video generation by introducing a two-stage, bounding-box–driven pipeline for autonomous driving scenes. A diffusion-based BBox Generator forecasts pixel-space bounding-box trajectories from initial and final frames, which then condition a Box2Video renderer built on a frozen Stable Video Diffusion backbone with a ControlNet adapter to produce RGB videos. The method supports 2D and 3D boxes, handles objects appearing mid-sequence, and yields higher fidelity videos than baselines, validated across KITTI, vKITTI, BDD100k, and nuScenes using metrics like FVD, LPIPS, SSIM, PSNR, and AP-based motion control. This approach provides a practical, controllable framework for generating realistic driving videos with explicit object motion constraints, enabling synthetic data generation with precise trajectory specifications and improved evaluation tools for bounding-box conditioned video synthesis.
Abstract
Controllable video generation has attracted significant attention, largely due to advances in video diffusion models. In domains such as autonomous driving, it is essential to develop highly accurate predictions for object motions. This paper tackles a crucial challenge of how to exert precise control over object motion for realistic video synthesis. To accomplish this, we 1) control object movements using bounding boxes and extend this control to the renderings of 2D or 3D boxes in pixel space, 2) employ a distinct, specialized model to forecast the trajectories of object bounding boxes based on their previous and, if desired, future positions, and 3) adapt and enhance a separate video diffusion network to create video content based on these high quality trajectory forecasts. Our method, Ctrl-V, leverages modified and fine-tuned Stable Video Diffusion (SVD) models to solve both trajectory and video generation. Extensive experiments conducted on the KITTI, Virtual-KITTI 2, BDD100k, and nuScenes datasets validate the effectiveness of our approach in producing realistic and controllable video generation.
