Table of Contents
Fetching ...

ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation

Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi, Yang Bai, Chi Zhang, Ziyuan Liu, Abhinav Valada

TL;DR

ConfCtrl is proposed, a confidence-aware video interpolation framework that enables diffusion models to follow prescribed camera poses while completing unseen regions, effectively reconstructing occluded regions under large viewpoint changes.

Abstract

We address the challenge of novel view synthesis from only two input images under large viewpoint changes. Existing regression-based methods lack the capacity to reconstruct unseen regions, while camera-guided diffusion models often deviate from intended trajectories due to noisy point cloud projections or insufficient conditioning from camera poses. To address these issues, we propose ConfCtrl, a confidence-aware video interpolation framework that enables diffusion models to follow prescribed camera poses while completing unseen regions. ConfCtrl initializes the diffusion process by combining a confidence-weighted projected point cloud latent with noise as the conditioning input. It then applies a Kalman-inspired predict-update mechanism, treating the projected point cloud as a noisy measurement and using learned residual corrections to balance pose-driven predictions with noisy geometric observations. This allows the model to rely on reliable projections while down-weighting uncertain regions, yielding stable, geometry-aware generation. Experiments on multiple datasets show that ConfCtrl produces geometrically consistent and visually plausible novel views, effectively reconstructing occluded regions under large viewpoint changes.

ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation

TL;DR

ConfCtrl is proposed, a confidence-aware video interpolation framework that enables diffusion models to follow prescribed camera poses while completing unseen regions, effectively reconstructing occluded regions under large viewpoint changes.

Abstract

We address the challenge of novel view synthesis from only two input images under large viewpoint changes. Existing regression-based methods lack the capacity to reconstruct unseen regions, while camera-guided diffusion models often deviate from intended trajectories due to noisy point cloud projections or insufficient conditioning from camera poses. To address these issues, we propose ConfCtrl, a confidence-aware video interpolation framework that enables diffusion models to follow prescribed camera poses while completing unseen regions. ConfCtrl initializes the diffusion process by combining a confidence-weighted projected point cloud latent with noise as the conditioning input. It then applies a Kalman-inspired predict-update mechanism, treating the projected point cloud as a noisy measurement and using learned residual corrections to balance pose-driven predictions with noisy geometric observations. This allows the model to rely on reliable projections while down-weighting uncertain regions, yielding stable, geometry-aware generation. Experiments on multiple datasets show that ConfCtrl produces geometrically consistent and visually plausible novel views, effectively reconstructing occluded regions under large viewpoint changes.
Paper Structure (13 sections, 12 equations, 5 figures, 4 tables)

This paper contains 13 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of ConfCtrl. (A) demonstrates the strong prior of a first– and end–frame video interpolation model. This is fully inherited by our model for further geometric improvement. (B) Our ConfCtrl enables precise camera control in video diffusion models. (C) shows that camera-guided video diffusion models struggle to strictly follow the target camera poses. (D) indicates that feedforward models underperform due to limited generative capacity, leading to inaccurate estimation of Gaussian Splat parameters.
  • Figure 2: The ConfCtrl framework. (A) Model Architecture. Our framework builds upon the Wan2.1-Interpolation model wan. We introduce two key designs to enhance precise camera-following capability: (1) noise initialization using a confidence-weighted projected point cloud latent, and (2) predict-update camera conditioning mechanism that balance the predicted state with the projected point cloud as a noisy measurement. (B) Kalman DiT Block. It consists of two submodules: predict and update.
  • Figure 3: Qualitative comparison with baseline methods on the CO3D-Hydrant, CO3D-Teddybear, and DL3DV datasets. The two input images exhibit a large viewpoint change, demonstrating the robustness of our method under challenging view variations.
  • Figure 4: Zero-shot results. Qualitative comparison between our method and the ground truth on out-of-distribution datasets, demonstrating strong generalization ability.
  • Figure 5: Qualitative Ablation. (a) Visual comparison of different model configurations showing the effect of each component. (b) Ablation across 3D foundation models. Our results and the ground truth (GT) are boxed in matching colors in (a).