Table of Contents
Fetching ...

SegMASt3R: Geometry Grounded Segment Matching

Rohit Jayanti, Swayam Agrawal, Vansh Garg, Siddharth Tourani, Muhammad Haris Khan, Sourav Garg, Madhava Krishna

TL;DR

SegMASt3R tackles wide-baseline segment matching by re-purposing a 3D foundation model (MASt3R) with a segment-feature head and a differentiable Sinkhorn-based matcher to produce robust segment correspondences across image pairs with extreme viewpoint changes ($180^{\circ}$). The approach leverages geometry-aware priors from 3D pretraining, enabling strong segment-level representations and matching performance that surpasses SAM2’s propagator and several local feature methods on indoor and outdoor benchmarks. It introduces a differentiable segment matching layer with a learnable dustbin and an end-to-end training objective, and demonstrates practical impact on downstream tasks such as 3D instance mapping and object-relative navigation. Overall, SegMASt3R establishes segment matching as a geometry-guided, transferable capability that improves robustness to occlusion, appearance changes, and perceptual aliasing in complex scenes.

Abstract

Segment matching is an important intermediate task in computer vision that establishes correspondences between semantically or geometrically coherent regions across images. Unlike keypoint matching, which focuses on localized features, segment matching captures structured regions, offering greater robustness to occlusions, lighting variations, and viewpoint changes. In this paper, we leverage the spatial understanding of 3D foundation models to tackle wide-baseline segment matching, a challenging setting involving extreme viewpoint shifts. We propose an architecture that uses the inductive bias of these 3D foundation models to match segments across image pairs with up to 180 degree view-point change rotation. Extensive experiments show that our approach outperforms state-of-the-art methods, including the SAM2 video propagator and local feature matching methods, by up to 30% on the AUPRC metric, on ScanNet++ and Replica datasets. We further demonstrate benefits of the proposed model on relevant downstream tasks, including 3D instance mapping and object-relative navigation. Project Page: https://segmast3r.github.io/

SegMASt3R: Geometry Grounded Segment Matching

TL;DR

SegMASt3R tackles wide-baseline segment matching by re-purposing a 3D foundation model (MASt3R) with a segment-feature head and a differentiable Sinkhorn-based matcher to produce robust segment correspondences across image pairs with extreme viewpoint changes (). The approach leverages geometry-aware priors from 3D pretraining, enabling strong segment-level representations and matching performance that surpasses SAM2’s propagator and several local feature methods on indoor and outdoor benchmarks. It introduces a differentiable segment matching layer with a learnable dustbin and an end-to-end training objective, and demonstrates practical impact on downstream tasks such as 3D instance mapping and object-relative navigation. Overall, SegMASt3R establishes segment matching as a geometry-guided, transferable capability that improves robustness to occlusion, appearance changes, and perceptual aliasing in complex scenes.

Abstract

Segment matching is an important intermediate task in computer vision that establishes correspondences between semantically or geometrically coherent regions across images. Unlike keypoint matching, which focuses on localized features, segment matching captures structured regions, offering greater robustness to occlusions, lighting variations, and viewpoint changes. In this paper, we leverage the spatial understanding of 3D foundation models to tackle wide-baseline segment matching, a challenging setting involving extreme viewpoint shifts. We propose an architecture that uses the inductive bias of these 3D foundation models to match segments across image pairs with up to 180 degree view-point change rotation. Extensive experiments show that our approach outperforms state-of-the-art methods, including the SAM2 video propagator and local feature matching methods, by up to 30% on the AUPRC metric, on ScanNet++ and Replica datasets. We further demonstrate benefits of the proposed model on relevant downstream tasks, including 3D instance mapping and object-relative navigation. Project Page: https://segmast3r.github.io/

Paper Structure

This paper contains 58 sections, 17 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Pipeline Overview: An image pair is processed by a frozen MASt3R backbone to extract patch‑level features; segmentation masks are obtained either from a parallel segmentation module or ground truth annotation; the patch-level features are aggregated by the segment-feature head to form segment-level descriptors; and these descriptors are then matched across images via a differentiable optimal transport layer to produce segment-level correspondences.
  • Figure 2: Segment Matching-Guided Navigation. (left) In vanilla RoboHop's segment matching, a wall segment (orange) gets mismatched with the vanity cabinet and misguides the agent to move towards its right, leading to a navigation failure. (right) SegMASt3R correctly recognizes the same cabinet as well as other segments and guides the robot into the bathroom, and eventually to the goal. Note that the query and submap images vary across both the methods, as we manually probed the point of failure for the baseline and the nearest agent state for ours.
  • Figure 3: MapFree Outdoor Dataset - Perceptual Instance Aliasing (left): the right leg of the signboard as a query segment (red) is correctly matched by our method but mismatched with its left leg by DINOv2. Sinkhorn Matches to Dustbin (right): the query segment (red) is not visible in the reference image and is correctly ignored by our method, whereas DINOv2 mismatches it with a vehicle segment.
  • Figure 4: ScanNet++ Dataset -- Wide-baseline Matching (top): The wall (pink) and the door (blue) in the query image (left) gets incorrectly associated by SAM2's video propagation (middle), whereas SegMASt3R (right) is able to correctly match them despite very limited visual overlap. Perceptual Instance Aliasing (bottom): unlike SAM2, SegMASt3R is able to correctly associate the pair of monitors under the simultaneous duress of an opposing viewpoint observation and perceptual instance aliasing.
  • Figure 5: More examples comparing proposed method against DINOv2 oquab2023dinov2 and SAM2 ravi2024sam for segment-matching on the ScanNet++ Dataset yeshwanthliu2023scannetpp under Wide-baseline conditions [135--180° viewpoint change]. Both baselines tend to incorrectly assign segment correspondences given an opposite viewing direction-the tables (pink and green) in (a) as well as the chairs (red and blue) in (c). Another failure mode SAM2 specifically exhibits is the inability to propagate masks in challenging view-point change settings as seen for the wheelchair (green) in (b). In contrast, the proposed method demonstrates accurate segment matching, attributed to its conditioning on 3D-aware priors. Note: Images have been greyed out to improve the visibility of the segments in question.
  • ...and 2 more figures