Table of Contents
Fetching ...

SmallGS: Gaussian Splatting-based Camera Pose Estimation for Small-Baseline Videos

Yuxin Yao, Yan Zhang, Zhening Huang, Joan Lasenby

TL;DR

Pose estimation in small-baseline dynamic videos is challenging due to weak parallax and drift. The authors propose SmallGS, which uses Gaussian splatting as an explicit 3D scene representation and optimizes camera poses in batched sliding windows anchored to the first frame of each segment, reducing reliance on depth variation. A key contribution is integrating $DINOv2$ visual features into the rasterization process and masking dynamic objects with semantic masks to improve robustness. Experiments on the TUM-dynamics dataset show that SmallGS achieves lower $ATE$ and $RPE$ and smoother trajectories than state-of-the-art methods such as MonST3R and DROID-SLAM, especially when using 16-channel $DINOv2$ features. The approach mitigates drift in small-baseline videos and enables robust pose estimation without explicit point correspondences or strong parallax.

Abstract

Dynamic videos with small baseline motions are ubiquitous in daily life, especially on social media. However, these videos present a challenge to existing pose estimation frameworks due to ambiguous features, drift accumulation, and insufficient triangulation constraints. Gaussian splatting, which maintains an explicit representation for scenes, provides a reliable novel view rasterization when the viewpoint change is small. Inspired by this, we propose SmallGS, a camera pose estimation framework that is specifically designed for small-baseline videos. SmallGS optimizes sequential camera poses using Gaussian splatting, which reconstructs the scene from the first frame in each video segment to provide a stable reference for the rest. The temporal consistency of Gaussian splatting within limited viewpoint differences reduced the requirement of sufficient depth variations in traditional camera pose estimation. We further incorporate pretrained robust visual features, e.g. DINOv2, into Gaussian splatting, where high-dimensional feature map rendering enhances the robustness of camera pose estimation. By freezing the Gaussian splatting and optimizing camera viewpoints based on rasterized features, SmallGS effectively learns camera poses without requiring explicit feature correspondences or strong parallax motion. We verify the effectiveness of SmallGS in small-baseline videos in TUM-Dynamics sequences, which achieves impressive accuracy in camera pose estimation compared to MonST3R and DORID-SLAM for small-baseline videos in dynamic scenes. Our project page is at: https://yuxinyao620.github.io/SmallGS

SmallGS: Gaussian Splatting-based Camera Pose Estimation for Small-Baseline Videos

TL;DR

Pose estimation in small-baseline dynamic videos is challenging due to weak parallax and drift. The authors propose SmallGS, which uses Gaussian splatting as an explicit 3D scene representation and optimizes camera poses in batched sliding windows anchored to the first frame of each segment, reducing reliance on depth variation. A key contribution is integrating visual features into the rasterization process and masking dynamic objects with semantic masks to improve robustness. Experiments on the TUM-dynamics dataset show that SmallGS achieves lower and and smoother trajectories than state-of-the-art methods such as MonST3R and DROID-SLAM, especially when using 16-channel features. The approach mitigates drift in small-baseline videos and enables robust pose estimation without explicit point correspondences or strong parallax.

Abstract

Dynamic videos with small baseline motions are ubiquitous in daily life, especially on social media. However, these videos present a challenge to existing pose estimation frameworks due to ambiguous features, drift accumulation, and insufficient triangulation constraints. Gaussian splatting, which maintains an explicit representation for scenes, provides a reliable novel view rasterization when the viewpoint change is small. Inspired by this, we propose SmallGS, a camera pose estimation framework that is specifically designed for small-baseline videos. SmallGS optimizes sequential camera poses using Gaussian splatting, which reconstructs the scene from the first frame in each video segment to provide a stable reference for the rest. The temporal consistency of Gaussian splatting within limited viewpoint differences reduced the requirement of sufficient depth variations in traditional camera pose estimation. We further incorporate pretrained robust visual features, e.g. DINOv2, into Gaussian splatting, where high-dimensional feature map rendering enhances the robustness of camera pose estimation. By freezing the Gaussian splatting and optimizing camera viewpoints based on rasterized features, SmallGS effectively learns camera poses without requiring explicit feature correspondences or strong parallax motion. We verify the effectiveness of SmallGS in small-baseline videos in TUM-Dynamics sequences, which achieves impressive accuracy in camera pose estimation compared to MonST3R and DORID-SLAM for small-baseline videos in dynamic scenes. Our project page is at: https://yuxinyao620.github.io/SmallGS

Paper Structure

This paper contains 18 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Camera pose estimation for small-baseline videos with SmallGS. Our method focuses on camera pose estimation in small-baseline videos, updating the camera poses based on the rasterization of Gaussian splatting in the learned camera viewpoints. We achieved better and smoother results compared to the previous SOTA MonST3R.
  • Figure 2: Pipeline of SmallGS. Our method follows the CF-3DGS pipeline, estimating camera poses in video segments. The process is: (1) Use MonST3R to predict depth maps, confidence masks, and camera intrinsics; (2) Lift the first frame's depth map into a dense point cloud, masking dynamic objects using the confidence mask as a semantic mask; (3) Initialize and update Gaussian splatting with the first frame; (4) Freeze the Gaussian parameters and optimize batched camera poses by minimizing the error between the rasterized feature maps (under the estimated poses) and the DINOv2 dinov2 feature maps, with semantic masks applied to both.
  • Figure 3: Fig. \ref{['fig:exp_MonST3R']} compares the estimated camera trajectories of MonST3R and SmallGS with 16-channel DINOv2 features. Fig. \ref{['fig:exp_DROID']} shows the same SmallGS trajectory compared to DROID. The red dashed line represents the ground truth.
  • Figure 4: Comparison of ground-truth trajectories, MonST3R-predicted trajectories, and the SmallGS-learned trajectory with 16-channel DINOv2 feature maps. The trajectories predicted by MonST3R often exhibit jitter around the ground-truth trajectories. SmallGS efficiently learns the trajectories of small-baseline videos, improving camera pose accuracy.