Table of Contents
Fetching ...

V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

Jiancheng Pan, Runze Wang, Tianwen Qian, Mohammad Mahdi, Yanwei Fu, Xiangyang Xue, Xiaomeng Huang, Luc Van Gool, Danda Pani Paudel, Yuqian Fu

TL;DR

V^2-SAM introduces a cross-view segmentation framework that adapts SAM2 to ego–exo object correspondence by coupling geometry-aware prompts (V^2-Anchor) with appearance-aware prompts (V^2-Visual), integrated through a multi-expert design and a Post-hoc Cyclic Consistency Selector (PCCS). The Anchor prompts restore coordinate-based prompting across views using DINOv3 features, while the Visual prompts leverage a VPMatcher to align cross-view representations in feature and structure spaces. Training jointly with visual, structural, and mask-based losses, the model can selectively exploit three experts (Anchor, Visual, Fusion) to maximize cross-view accuracy, enabling robust performance on Ego-Exo4D, DAVIS-2017, and HANDAL-X. The approach achieves state-of-the-art results and demonstrates strong generalization to robotics-ready cross-view scenarios, highlighting the potential of combining geometry- and appearance-guided prompting with adaptive expert selection for cross-view perception tasks.

Abstract

Cross-view object correspondence, exemplified by the representative task of ego-exo object correspondence, aims to establish consistent associations of the same object across different viewpoints (e.g., ego-centric and exo-centric). This task poses significant challenges due to drastic viewpoint and appearance variations, making existing segmentation models, such as SAM2, non-trivial to apply directly. To address this, we present V^2-SAM, a unified cross-view object correspondence framework that adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators. Specifically, the Cross-View Anchor Prompt Generator (V^2-Anchor), built upon DINOv3 features, establishes geometry-aware correspondences and, for the first time, unlocks coordinate-based prompting for SAM2 in cross-view scenarios, while the Cross-View Visual Prompt Generator (V^2-Visual) enhances appearance-guided cues via a novel visual prompt matcher that aligns ego-exo representations from both feature and structural perspectives. To effectively exploit the strengths of both prompts, we further adopt a multi-expert design and introduce a Post-hoc Cyclic Consistency Selector (PCCS) that adaptively selects the most reliable expert based on cyclic consistency. Extensive experiments validate the effectiveness of V^2-SAM, achieving new state-of-the-art performance on Ego-Exo4D (ego-exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence).

V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

TL;DR

V^2-SAM introduces a cross-view segmentation framework that adapts SAM2 to ego–exo object correspondence by coupling geometry-aware prompts (V^2-Anchor) with appearance-aware prompts (V^2-Visual), integrated through a multi-expert design and a Post-hoc Cyclic Consistency Selector (PCCS). The Anchor prompts restore coordinate-based prompting across views using DINOv3 features, while the Visual prompts leverage a VPMatcher to align cross-view representations in feature and structure spaces. Training jointly with visual, structural, and mask-based losses, the model can selectively exploit three experts (Anchor, Visual, Fusion) to maximize cross-view accuracy, enabling robust performance on Ego-Exo4D, DAVIS-2017, and HANDAL-X. The approach achieves state-of-the-art results and demonstrates strong generalization to robotics-ready cross-view scenarios, highlighting the potential of combining geometry- and appearance-guided prompting with adaptive expert selection for cross-view perception tasks.

Abstract

Cross-view object correspondence, exemplified by the representative task of ego-exo object correspondence, aims to establish consistent associations of the same object across different viewpoints (e.g., ego-centric and exo-centric). This task poses significant challenges due to drastic viewpoint and appearance variations, making existing segmentation models, such as SAM2, non-trivial to apply directly. To address this, we present V^2-SAM, a unified cross-view object correspondence framework that adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators. Specifically, the Cross-View Anchor Prompt Generator (V^2-Anchor), built upon DINOv3 features, establishes geometry-aware correspondences and, for the first time, unlocks coordinate-based prompting for SAM2 in cross-view scenarios, while the Cross-View Visual Prompt Generator (V^2-Visual) enhances appearance-guided cues via a novel visual prompt matcher that aligns ego-exo representations from both feature and structural perspectives. To effectively exploit the strengths of both prompts, we further adopt a multi-expert design and introduce a Post-hoc Cyclic Consistency Selector (PCCS) that adaptively selects the most reliable expert based on cyclic consistency. Extensive experiments validate the effectiveness of V^2-SAM, achieving new state-of-the-art performance on Ego-Exo4D (ego-exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence).

Paper Structure

This paper contains 27 sections, 8 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Comparison of SAM variants in segmentation capability. Our proposed V²-SAM supports coordinate-point and visual-reference prompts for cross-view segmentation.
  • Figure 2: V$^{2}$-SAM framework. It introduces V$^{2}$-Anchor for coordinate-based cross-view prompting, V$^{2}$-Visual for enhanced appearance-guided visual matching, and a multi-prompt expert framework equipped with the PCCS module for adaptive expert selection.
  • Figure 3: The structure of Visual Prompt Matcher. The Structural Mapping Branch is built upon a lightweight CNN-based mask encoder and decoder. The Feature Mapping Branch leverages Transformer-based cross-attention layers, while the Res-MLP component serves as a residual multi-layer perceptron.
  • Figure 4: Comparison of Anchor, Visual, and Fusion Experts across different scenes. Left: per-scene IoU radar plot for the three experts. Right: per-scene Win% bars showing PCCS selections.
  • Figure 5: Ego2Exo qualitative results. From left to right: query view, predictions from the Anchor Expert, Visual Expert, and Fusion Expert, followed by the final output selected by the PCCS.
  • ...and 9 more figures