Table of Contents
Fetching ...

Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion

Ge Ya Luo, Zhi Hao Luo, Anthony Gosselin, Alexia Jolicoeur-Martineau, Christopher Pal

TL;DR

Ctrl-V tackles the challenge of controllable video generation by introducing a two-stage, bounding-box–driven pipeline for autonomous driving scenes. A diffusion-based BBox Generator forecasts pixel-space bounding-box trajectories from initial and final frames, which then condition a Box2Video renderer built on a frozen Stable Video Diffusion backbone with a ControlNet adapter to produce RGB videos. The method supports 2D and 3D boxes, handles objects appearing mid-sequence, and yields higher fidelity videos than baselines, validated across KITTI, vKITTI, BDD100k, and nuScenes using metrics like FVD, LPIPS, SSIM, PSNR, and AP-based motion control. This approach provides a practical, controllable framework for generating realistic driving videos with explicit object motion constraints, enabling synthetic data generation with precise trajectory specifications and improved evaluation tools for bounding-box conditioned video synthesis.

Abstract

Controllable video generation has attracted significant attention, largely due to advances in video diffusion models. In domains such as autonomous driving, it is essential to develop highly accurate predictions for object motions. This paper tackles a crucial challenge of how to exert precise control over object motion for realistic video synthesis. To accomplish this, we 1) control object movements using bounding boxes and extend this control to the renderings of 2D or 3D boxes in pixel space, 2) employ a distinct, specialized model to forecast the trajectories of object bounding boxes based on their previous and, if desired, future positions, and 3) adapt and enhance a separate video diffusion network to create video content based on these high quality trajectory forecasts. Our method, Ctrl-V, leverages modified and fine-tuned Stable Video Diffusion (SVD) models to solve both trajectory and video generation. Extensive experiments conducted on the KITTI, Virtual-KITTI 2, BDD100k, and nuScenes datasets validate the effectiveness of our approach in producing realistic and controllable video generation.

Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion

TL;DR

Ctrl-V tackles the challenge of controllable video generation by introducing a two-stage, bounding-box–driven pipeline for autonomous driving scenes. A diffusion-based BBox Generator forecasts pixel-space bounding-box trajectories from initial and final frames, which then condition a Box2Video renderer built on a frozen Stable Video Diffusion backbone with a ControlNet adapter to produce RGB videos. The method supports 2D and 3D boxes, handles objects appearing mid-sequence, and yields higher fidelity videos than baselines, validated across KITTI, vKITTI, BDD100k, and nuScenes using metrics like FVD, LPIPS, SSIM, PSNR, and AP-based motion control. This approach provides a practical, controllable framework for generating realistic driving videos with explicit object motion constraints, enabling synthetic data generation with precise trajectory specifications and improved evaluation tools for bounding-box conditioned video synthesis.

Abstract

Controllable video generation has attracted significant attention, largely due to advances in video diffusion models. In domains such as autonomous driving, it is essential to develop highly accurate predictions for object motions. This paper tackles a crucial challenge of how to exert precise control over object motion for realistic video synthesis. To accomplish this, we 1) control object movements using bounding boxes and extend this control to the renderings of 2D or 3D boxes in pixel space, 2) employ a distinct, specialized model to forecast the trajectories of object bounding boxes based on their previous and, if desired, future positions, and 3) adapt and enhance a separate video diffusion network to create video content based on these high quality trajectory forecasts. Our method, Ctrl-V, leverages modified and fine-tuned Stable Video Diffusion (SVD) models to solve both trajectory and video generation. Extensive experiments conducted on the KITTI, Virtual-KITTI 2, BDD100k, and nuScenes datasets validate the effectiveness of our approach in producing realistic and controllable video generation.
Paper Structure (42 sections, 3 equations, 26 figures, 8 tables)

This paper contains 42 sections, 3 equations, 26 figures, 8 tables.

Figures (26)

  • Figure 1: Overview of Ctrl-V's generation pipeline: (Left) inputs: Our inputs include an initial frame, its corresponding bounding box image and the final frame's bounding box image. (Middle) generated bounding box trajectories: We demonstrate three distinct possible trajectory sequences produced by our diffusion-based bounding box motion generation model -- BBox Generator. (Right) generated video clips: Our Box2Video model conditions on the generated bounding box trajectory frames to produce the final video clips.
  • Figure 2: The diagram illustrates two components of Ctrl-V: (left) the BBox Generator and (right) Box2Video. For both models, we use a frozen, off-the-shelf VAE to encode images into latent space ($\mathcal{E}$) and decode them back into pixel space ($\mathcal{D}$). During training, (1) the BBox Generator (Sec. \ref{['sec:bbox_predictor']}) learns to denoise the noisy bounding box frame latents235, 183, 5257, 43, 186 $\hat{{\bm{b}}}_t$, conditioned on the first (${\bm{b}}^{(0)}$) and last (${\bm{b}}^{(N-1)}$) bounding box frame latents and the padded initial frame latent${\bm{z}}_{pad}^{(0)}$ and (2) the Box2Video (Sec. \ref{['sec:controlnet']}) denoises the target frame latents102, 204, 0247, 20, 213 $\hat{{\bm{z}}}_t$ by conditioning on the initial frame's latent${\bm{z}}_{pad}^{(0)}$ (input to the SVD UNet) and the bounding box frame latents235, 183, 5257, 43, 186 ${\bm{b}}$ (input to the ControlNet).
  • Figure 3: The first two rows illustrate video samples generated using the Ctrl-V pipeline, with one initial frame, three initial bounding box frames, and one final bounding box frame as input. The first row shows bounding box trajectories from the BBox-generator in pixel space (solid rectangles for predictions, wireframe rectangles for ground truth). The second row presents frames generated by the Box2Video model, conditioned on the BBox-generator's output. The third row displays ground-truth frames, while the fourth row shows frames generated by the Stable Video Diffusion (SVD) baseline. In the Ctrl-V video, the car with the bright-green bounding box, which initially pokes out its nose in the lane to the left of the ego car, stays beside the ego car in the final frame. Meanwhile, the silver car with the olive bounding box, which starts in the lane to the right of the ego car, speeds off and is replaced by a new car (purple bounding box) entering the frame. These generated frames closely match the car positions seen in the conditioned inputs. In contrast, the SVD-generated video shows the black car on the left accelerating and moving ahead of the ego car, while the silver car remains in the same relative position to the ego car throughout.
  • Figure 4: This figure visualizes two samples of bounding box trajectories generated by the BBox Generator, conditioned on the same set of three initial bounding box frames and one final bounding box frame (solid rectangles represent predictions, and wireframe rectangles represent ground truth). Although the intermediate frames show notable differences, the initial and final frames align closely with the ground-truth bounding boxes.
  • Figure 5: Illustrations of the Box2Video generations conditioned on ground truth 3D bounding box trajectories (2D for BDD) across various datasets. The 2D outlines of the ground-truth bounding boxes are overlaid on top.
  • ...and 21 more figures