V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence
Jiancheng Pan, Runze Wang, Tianwen Qian, Mohammad Mahdi, Yanwei Fu, Xiangyang Xue, Xiaomeng Huang, Luc Van Gool, Danda Pani Paudel, Yuqian Fu
TL;DR
V^2-SAM introduces a cross-view segmentation framework that adapts SAM2 to ego–exo object correspondence by coupling geometry-aware prompts (V^2-Anchor) with appearance-aware prompts (V^2-Visual), integrated through a multi-expert design and a Post-hoc Cyclic Consistency Selector (PCCS). The Anchor prompts restore coordinate-based prompting across views using DINOv3 features, while the Visual prompts leverage a VPMatcher to align cross-view representations in feature and structure spaces. Training jointly with visual, structural, and mask-based losses, the model can selectively exploit three experts (Anchor, Visual, Fusion) to maximize cross-view accuracy, enabling robust performance on Ego-Exo4D, DAVIS-2017, and HANDAL-X. The approach achieves state-of-the-art results and demonstrates strong generalization to robotics-ready cross-view scenarios, highlighting the potential of combining geometry- and appearance-guided prompting with adaptive expert selection for cross-view perception tasks.
Abstract
Cross-view object correspondence, exemplified by the representative task of ego-exo object correspondence, aims to establish consistent associations of the same object across different viewpoints (e.g., ego-centric and exo-centric). This task poses significant challenges due to drastic viewpoint and appearance variations, making existing segmentation models, such as SAM2, non-trivial to apply directly. To address this, we present V^2-SAM, a unified cross-view object correspondence framework that adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators. Specifically, the Cross-View Anchor Prompt Generator (V^2-Anchor), built upon DINOv3 features, establishes geometry-aware correspondences and, for the first time, unlocks coordinate-based prompting for SAM2 in cross-view scenarios, while the Cross-View Visual Prompt Generator (V^2-Visual) enhances appearance-guided cues via a novel visual prompt matcher that aligns ego-exo representations from both feature and structural perspectives. To effectively exploit the strengths of both prompts, we further adopt a multi-expert design and introduce a Post-hoc Cyclic Consistency Selector (PCCS) that adaptively selects the most reliable expert based on cyclic consistency. Extensive experiments validate the effectiveness of V^2-SAM, achieving new state-of-the-art performance on Ego-Exo4D (ego-exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence).
