Table of Contents
Fetching ...

POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D Reconstruction

Songyan Zhang, Yongtao Ge, Jinyuan Tian, Guangkai Xu, Hao Chen, Chen Lv, Chunhua Shen

TL;DR

Dynamic 3D reconstruction in dynamic scenes suffers from ambiguous point matching and motion-induced inconsistencies. POMATO Addresses this by unifying geometry estimation with explicit pointmap matching and a temporal motion module that propagates information across frames in a transformer-based temporal backbone. The approach yields strong results across video depth estimation, 3D point tracking, and camera pose estimation, and runs in a feed-forward manner without external optical-flow modules, while providing interpretability through explicit correspondences. The work demonstrates practical impact for fast, robust dynamic 3D reconstruction with potential for scalable deployment and further improvements through larger dynamic datasets.

Abstract

3D reconstruction in dynamic scenes primarily relies on the combination of geometry estimation and matching modules where the latter task is pivotal for distinguishing dynamic regions which can help to mitigate the interference introduced by camera and object motion. Furthermore, the matching module explicitly models object motion, enabling the tracking of specific targets and advancing motion understanding in complex scenarios. Recently, the proposed representation of pointmap in DUSt3R suggests a potential solution to unify both geometry estimation and matching in 3D space, but it still struggles with ambiguous matching in dynamic regions, which may hamper further improvement. In this work, we present POMATO, a unified framework for dynamic 3D reconstruction by marrying pointmap matching with temporal motion. Specifically, our method first learns an explicit matching relationship by mapping RGB pixels from both dynamic and static regions across different views to 3D pointmaps within a unified coordinate system. Furthermore, we introduce a temporal motion module for dynamic motions that ensures scale consistency across different frames and enhances performance in tasks requiring both precise geometry and reliable matching, most notably 3D point tracking. We show the effectiveness of the proposed pointmap matching and temporal fusion paradigm by demonstrating the remarkable performance across multiple downstream tasks, including video depth estimation, 3D point tracking, and pose estimation. Code and models are publicly available at https://github.com/wyddmw/POMATO.

POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D Reconstruction

TL;DR

Dynamic 3D reconstruction in dynamic scenes suffers from ambiguous point matching and motion-induced inconsistencies. POMATO Addresses this by unifying geometry estimation with explicit pointmap matching and a temporal motion module that propagates information across frames in a transformer-based temporal backbone. The approach yields strong results across video depth estimation, 3D point tracking, and camera pose estimation, and runs in a feed-forward manner without external optical-flow modules, while providing interpretability through explicit correspondences. The work demonstrates practical impact for fast, robust dynamic 3D reconstruction with potential for scalable deployment and further improvements through larger dynamic datasets.

Abstract

3D reconstruction in dynamic scenes primarily relies on the combination of geometry estimation and matching modules where the latter task is pivotal for distinguishing dynamic regions which can help to mitigate the interference introduced by camera and object motion. Furthermore, the matching module explicitly models object motion, enabling the tracking of specific targets and advancing motion understanding in complex scenarios. Recently, the proposed representation of pointmap in DUSt3R suggests a potential solution to unify both geometry estimation and matching in 3D space, but it still struggles with ambiguous matching in dynamic regions, which may hamper further improvement. In this work, we present POMATO, a unified framework for dynamic 3D reconstruction by marrying pointmap matching with temporal motion. Specifically, our method first learns an explicit matching relationship by mapping RGB pixels from both dynamic and static regions across different views to 3D pointmaps within a unified coordinate system. Furthermore, we introduce a temporal motion module for dynamic motions that ensures scale consistency across different frames and enhances performance in tasks requiring both precise geometry and reliable matching, most notably 3D point tracking. We show the effectiveness of the proposed pointmap matching and temporal fusion paradigm by demonstrating the remarkable performance across multiple downstream tasks, including video depth estimation, 3D point tracking, and pose estimation. Code and models are publicly available at https://github.com/wyddmw/POMATO.

Paper Structure

This paper contains 19 sections, 9 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: 3D reconstruction from an arbitrary dynamic video with POMATO. Without relying on external modules, POMATO can directly perform 3D reconstruction along with temporal 3D point tracking and dynamic mask estimation.
  • Figure 2: Ambiguity in 3D point matching in dynamic scenes with DUSt3R. Given representative corresponding pixels of background (orange) and moving foreground (red) in two different views, DUSt3R outputs a pair of 3D points within the same coordinate system. In static regions, identical pixels share the same 3D coordinates which provide an accurate matching relationship in 3D space, but in moving regions, the 3D coordinates are inconsistent for corresponding pixels across views, leading to ambiguous 3D matching relationships.
  • Figure 3: Overview of our training pipeline. (1) Stage I: build upon DUSt3R dust3r architecture, we introduce a third regression point-matching head: Head$_3$, which is in parallel to Head$_2$ for explicit pointmap matching in 3D space. For each pixel in the second view, the output pointmap coordinate is the 3D point map of the corresponding pixel in the first view. (2) Stage II: we introduce a temporal fusion module in three heads that enables multi-style sequential input for learning temporal motions.
  • Figure 4: Architecture of our temporal motion module. We insert a transformer-based motion module (in shallow yellow) into the vanilla DPT dpt head to enhance the temporal consistency.
  • Figure 5: Inference pipelines for point tracking, video depth, and multi-view reconstruction.$t_k$ indicates the keyframe. With the help of the motion module and flexible input construction, POMATO can be easily applied to downstream temporal tasks.
  • ...and 5 more figures