Towards Cross-View Point Correspondence in Vision-Language Models

Yipu Wang; Yuheng Ji; Yuyang Liu; Enshen Zhou; Ziqiang Yang; Yuxuan Tian; Ziheng Qin; Yue Liu; Huajie Tan; Cheng Chi; Zhiyuan Ma; Daniel Dajun Zeng; Xiaolong Zheng

Towards Cross-View Point Correspondence in Vision-Language Models

Yipu Wang, Yuheng Ji, Yuyang Liu, Enshen Zhou, Ziqiang Yang, Yuxuan Tian, Ziheng Qin, Yue Liu, Huajie Tan, Cheng Chi, Zhiyuan Ma, Daniel Dajun Zeng, Xiaolong Zheng

TL;DR

This work defines Cross-View Point Correspondence (CVPC) to enable precise point-level cross-view mapping for embodied AI. It introduces CrossPoint-Bench and CrossPoint-378K to evaluate and train models on fine-grained, affordance-driven cross-view tasks, and presents CroPond as a supervised-finetuned baseline that significantly closes the gap to human performance. Key findings show that current VLMs struggle with continuous coordinate prediction and cross-view geometry, but scale and targeted spatial supervision yield strong gains, including robust generalization to related spatial benchmarks and multi-agent scenarios. The results underscore the importance of geometric consistency and affordance-focused data for enabling reliable cross-view reasoning in real-world manipulation tasks.

Abstract

Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. However, it is still far from being realized in Vision-Language Models (VLMs), especially in achieving precise point-level correspondence, which is crucial for precise affordance interaction. So we propose the Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench, a comprehensive benchmark with hierarchical design, inspired by the human cognitive process of "perceive", "reason", and "correspond". Our evaluation shows the state-of-the-art models (e.g., Gemini-2.5-Pro) still fall far behind humans, with a gap of over 54.65% in overall accuracy, exposing a challenge in transitioning from coarse-grained judgement to fine-grained coordinate prediction. To address this problem, we construct CrossPoint-378K, a dataset with 378K question-answering pairs across 900 scenes, focused on actionable affordance regions that better reflect real-world manipulation and interaction scenarios. Furthermore, we propose CroPond that trained on the CrossPoint-378K dataset. Our CroPond achieves state-of-the-art performance on CrossPoint-Bench, surpassing Gemini-2.5-Pro by 39.7% accuracy, which offers a foundation for advancing future work on cross-view correspondence. The benchmark, dataset, and model are publicly available at https://github.com/WangYipu2002/CrossPoint.

Towards Cross-View Point Correspondence in Vision-Language Models

TL;DR

Abstract

Towards Cross-View Point Correspondence in Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (22)