Table of Contents
Fetching ...

CamCtrl3D: Single-Image Scene Exploration with Precise 3D Camera Control

Stefan Popov, Amit Raj, Michael Krainin, Yuanzhen Li, William T. Freeman, Michael Rubinstein

TL;DR

CamCtrl3D addresses the challenge of generating immersive fly-through videos from a single image and a 3D camera path by extending a pretrained latent video diffusion model with four conditioning streams. A ControlNet-style fusion combines raw extrinsics, camera rays, re-projected initial image, and a global 3D representation via 2D↔3D transformers to achieve geometry-aware view synthesis. The authors introduce a metric balancing overall video quality with detail preservation, calibrate datasets to metric scales, and demonstrate state-of-the-art results on RealEstate10K and DL3DV with a final model trained on 10K posed videos. This yields high-fidelity, 3D-consistent fly-throughs with relatively modest data requirements, advancing practical single-image scene exploration.

Abstract

We propose a method for generating fly-through videos of a scene, from a single image and a given camera trajectory. We build upon an image-to-video latent diffusion model. We condition its UNet denoiser on the camera trajectory, using four techniques. (1) We condition the UNet's temporal blocks on raw camera extrinsics, similar to MotionCtrl. (2) We use images containing camera rays and directions, similar to CameraCtrl. (3) We reproject the initial image to subsequent frames and use the resulting video as a condition. (4) We use 2D<=>3D transformers to introduce a global 3D representation, which implicitly conditions on the camera poses. We combine all conditions in a ContolNet-style architecture. We then propose a metric that evaluates overall video quality and the ability to preserve details with view changes, which we use to analyze the trade-offs of individual and combined conditions. Finally, we identify an optimal combination of conditions. We calibrate camera positions in our datasets for scale consistency across scenes, and we train our scene exploration model, CamCtrl3D, demonstrating state-of-theart results.

CamCtrl3D: Single-Image Scene Exploration with Precise 3D Camera Control

TL;DR

CamCtrl3D addresses the challenge of generating immersive fly-through videos from a single image and a 3D camera path by extending a pretrained latent video diffusion model with four conditioning streams. A ControlNet-style fusion combines raw extrinsics, camera rays, re-projected initial image, and a global 3D representation via 2D↔3D transformers to achieve geometry-aware view synthesis. The authors introduce a metric balancing overall video quality with detail preservation, calibrate datasets to metric scales, and demonstrate state-of-the-art results on RealEstate10K and DL3DV with a final model trained on 10K posed videos. This yields high-fidelity, 3D-consistent fly-throughs with relatively modest data requirements, advancing practical single-image scene exploration.

Abstract

We propose a method for generating fly-through videos of a scene, from a single image and a given camera trajectory. We build upon an image-to-video latent diffusion model. We condition its UNet denoiser on the camera trajectory, using four techniques. (1) We condition the UNet's temporal blocks on raw camera extrinsics, similar to MotionCtrl. (2) We use images containing camera rays and directions, similar to CameraCtrl. (3) We reproject the initial image to subsequent frames and use the resulting video as a condition. (4) We use 2D<=>3D transformers to introduce a global 3D representation, which implicitly conditions on the camera poses. We combine all conditions in a ContolNet-style architecture. We then propose a metric that evaluates overall video quality and the ability to preserve details with view changes, which we use to analyze the trade-offs of individual and combined conditions. Finally, we identify an optimal combination of conditions. We calibrate camera positions in our datasets for scale consistency across scenes, and we train our scene exploration model, CamCtrl3D, demonstrating state-of-theart results.
Paper Structure (20 sections, 7 figures, 2 tables)

This paper contains 20 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Our method CamCtrl3D generates videos of scene fly-throughs, given an initial image for frame #0 and a 3D camera trajectory (bottom row). The generated videos are high-quality and closely match the ground truth (top row).
  • Figure 2: Top left: We add camera conditioning to the UNet denoiser of SVD blattmann23svd by modifying its layers. Top right: We attach the camera extrinsics and the 2D$\Leftrightarrow$3D transformer conditions to UNet's temporal layers (Sections \ref{['sec:cond-raw-extr']} and \ref{['sec:cond-raytran']}). Bottom: We add additional top-level convolutional layers for the camera ray and re-projected image conditions (Sections \ref{['sec:cond-rayod']} and \ref{['sec:cond-reproj']})
  • Figure 3: We re-project the surfaces observed on the initial image to all subsequent frames, using ZoeDepth bhat23zoedepth to estimate a point cloud. We use the resulting frames as a condition (Section \ref{['sec:cond-reproj']}) and during evaluation (Section \ref{['sec:results-eval-metric']}).
  • Figure 4: We apply conditions to a clone of the UNet encoder (Section \ref{['sec:cnet-cond']}), and we add its outgoing residual connections to those of the original encoder, after passing through zero convolutions zhang2023controlnet.
  • Figure 5: Re-projection (Sec. \ref{['sec:cond-reproj']}) identifies regions within a frame originating from the initial image (e.g. frame #12 here). We apply the resulting mask to both ground truth and generated images and measure image difference (Section \ref{['sec:results-eval-metric']}) to assess the model's ability to maintain visual consistency during camera change.
  • ...and 2 more figures