Table of Contents
Fetching ...

Interp3R: Continuous-time 3D Geometry Estimation with Frames and Events

Shuang Guo, Filbert Febryanto, Lei Sun, Guillermo Gallego

Abstract

In recent years, 3D visual foundation models pioneered by pointmap-based approaches such as DUSt3R have attracted a lot of interest, achieving impressive accuracy and strong generalization across diverse scenes. However, these methods are inherently limited to recovering scene geometry only at the discrete time instants when images are captured, leaving the scene evolution during the blind time between consecutive frames largely unexplored. We introduce Interp3R, to the best of our knowledge the first method that enhances pointmap-based models to estimate depth and camera poses at arbitrary time instants. Interp3R leverages asynchronous event data to interpolate pointmaps produced by frame-based models, enabling temporally continuous geometric representations. Depth and camera poses are then jointly recovered by aligning the interpolated pointmaps together with those predicted by the underlying frame-based models into a consistent spatial framework. We train Interp3R exclusively on a synthetic dataset, yet demonstrate strong generalization across a wide range of synthetic and real-world benchmarks. Extensive experiments show that Interp3R outperforms by a considerable margin state-of-the-art baselines that follow a two-stage pipeline of 2D video frame interpolation followed by 3D geometry estimation.

Interp3R: Continuous-time 3D Geometry Estimation with Frames and Events

Abstract

In recent years, 3D visual foundation models pioneered by pointmap-based approaches such as DUSt3R have attracted a lot of interest, achieving impressive accuracy and strong generalization across diverse scenes. However, these methods are inherently limited to recovering scene geometry only at the discrete time instants when images are captured, leaving the scene evolution during the blind time between consecutive frames largely unexplored. We introduce Interp3R, to the best of our knowledge the first method that enhances pointmap-based models to estimate depth and camera poses at arbitrary time instants. Interp3R leverages asynchronous event data to interpolate pointmaps produced by frame-based models, enabling temporally continuous geometric representations. Depth and camera poses are then jointly recovered by aligning the interpolated pointmaps together with those predicted by the underlying frame-based models into a consistent spatial framework. We train Interp3R exclusively on a synthetic dataset, yet demonstrate strong generalization across a wide range of synthetic and real-world benchmarks. Extensive experiments show that Interp3R outperforms by a considerable margin state-of-the-art baselines that follow a two-stage pipeline of 2D video frame interpolation followed by 3D geometry estimation.
Paper Structure (39 sections, 5 equations, 6 figures, 5 tables)

This paper contains 39 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The proposed method takes two images $I_0,I_1$ and the events between them $\mathcal{E}$ as input (Left), and recovers the scene geometry (in the form of pointmaps) at an arbitrary time instant $\tau$ within the interval between the two images (Right). All available pointmaps can then be used to estimate depth maps and camera poses.
  • Figure 2: Pipeline for pointmap prediction and interpolation. The pointmap model first takes as input two frames and outputs the source pointmaps. Interp3R then interpolates the pointmap at the arbitary target time $\tau \in (0, 1)$ from two directions, with the first part ($0 \rightarrow \tau$, "forward") and the second part ($1 \rightarrow \tau$, "backward") event data.
  • Figure 3: Illustration of the coarse-to-fine global alignment. The components of the coarse alignment are plotted in blue, and those of the fine alignment are in red.
  • Figure 4: Qualitative comparison of depth estimation on the Bonn dataset (skip = 3).
  • Figure 5: Qualitative results on the DAVIS dataset (skip = 3).
  • ...and 1 more figures