Table of Contents
Fetching ...

Dense Dynamic Scene Reconstruction and Camera Pose Estimation from Multi-View Videos

Shuo Sun, Unal Artan, Malcolm Mielle, Achim J. Lilienthaland, Martin Magnusson

Abstract

We address the challenging problem of dense dynamic scene reconstruction and camera pose estimation from multiple freely moving cameras -- a setting that arises naturally when multiple observers capture a shared event. Prior approaches either handle only single-camera input or require rigidly mounted, pre-calibrated camera rigs, limiting their practical applicability. We propose a two-stage optimization framework that decouples the task into robust camera tracking and dense depth refinement. In the first stage, we extend single-camera visual SLAM to the multi-camera setting by constructing a spatiotemporal connection graph that exploits both intra-camera temporal continuity and inter-camera spatial overlap, enabling consistent scale and robust tracking. To ensure robustness under limited overlap, we introduce a wide-baseline initialization strategy using feed-forward reconstruction models. In the second stage, we refine depth and camera poses by optimizing dense inter- and intra-camera consistency using wide-baseline optical flow. Additionally, we introduce MultiCamRobolab, a new real-world dataset with ground-truth poses from a motion capture system. Finally, we demonstrate that our method significantly outperforms state-of-the-art feed-forward models on both synthetic and real-world benchmarks, while requiring less memory.

Dense Dynamic Scene Reconstruction and Camera Pose Estimation from Multi-View Videos

Abstract

We address the challenging problem of dense dynamic scene reconstruction and camera pose estimation from multiple freely moving cameras -- a setting that arises naturally when multiple observers capture a shared event. Prior approaches either handle only single-camera input or require rigidly mounted, pre-calibrated camera rigs, limiting their practical applicability. We propose a two-stage optimization framework that decouples the task into robust camera tracking and dense depth refinement. In the first stage, we extend single-camera visual SLAM to the multi-camera setting by constructing a spatiotemporal connection graph that exploits both intra-camera temporal continuity and inter-camera spatial overlap, enabling consistent scale and robust tracking. To ensure robustness under limited overlap, we introduce a wide-baseline initialization strategy using feed-forward reconstruction models. In the second stage, we refine depth and camera poses by optimizing dense inter- and intra-camera consistency using wide-baseline optical flow. Additionally, we introduce MultiCamRobolab, a new real-world dataset with ground-truth poses from a motion capture system. Finally, we demonstrate that our method significantly outperforms state-of-the-art feed-forward models on both synthetic and real-world benchmarks, while requiring less memory.
Paper Structure (29 sections, 15 equations, 5 figures, 7 tables)

This paper contains 29 sections, 15 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Given video sequences captured from multiple free cameras, our method can recover dense dynamic scenes consistently and estimate camera poses accurately. We illustrate our full pipeline (top) alongside reconstruction results on four additional sequences (bottom).
  • Figure 2: Method Overview. Given multiple video inputs: Our method first uses a feed-forward model for initialization to achieve a global scale anchor and initialized poses (Step1). Then, we build a spatio-temporal connection graph during tracking to estimate camera poses and maintain a consistent scale (Step2). At last, we leverage the dense optical flow, estimated poses, and achieved connection graph to refine per-pixel depth to get a consistent scene and refined camera poses.
  • Figure 3: Demonstration spatio-temporal graph. First, each camera will estimate temporal connections with its own frames. Second, at the timestamp $t_0$, Cam.1 will try to make a spatial connection with Cam.0 if there is enough overlap. Additionally, the current active keyframe will try to make spatio-temporal connections with those inactive frames from other cameras if there is enough overlap. Ablation studies( \ref{['subsec:ablation_study']}) show spatio-temporal connections improve tracking accuracy.
  • Figure 4: Visualization (projected to X-Y plane) of camera trajectories estimated by different methods in the two datasets. Multiple camera trajectories are treated as one and aligned with GT trajectories by SIM(3) alignment.
  • Figure 5: Qualitative reconstruction results on MultiCamRobolab datasets.