Table of Contents
Fetching ...

Towards Cross-View Point Correspondence in Vision-Language Models

Yipu Wang, Yuheng Ji, Yuyang Liu, Enshen Zhou, Ziqiang Yang, Yuxuan Tian, Ziheng Qin, Yue Liu, Huajie Tan, Cheng Chi, Zhiyuan Ma, Daniel Dajun Zeng, Xiaolong Zheng

TL;DR

This work defines Cross-View Point Correspondence (CVPC) to enable precise point-level cross-view mapping for embodied AI. It introduces CrossPoint-Bench and CrossPoint-378K to evaluate and train models on fine-grained, affordance-driven cross-view tasks, and presents CroPond as a supervised-finetuned baseline that significantly closes the gap to human performance. Key findings show that current VLMs struggle with continuous coordinate prediction and cross-view geometry, but scale and targeted spatial supervision yield strong gains, including robust generalization to related spatial benchmarks and multi-agent scenarios. The results underscore the importance of geometric consistency and affordance-focused data for enabling reliable cross-view reasoning in real-world manipulation tasks.

Abstract

Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. However, it is still far from being realized in Vision-Language Models (VLMs), especially in achieving precise point-level correspondence, which is crucial for precise affordance interaction. So we propose the Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench, a comprehensive benchmark with hierarchical design, inspired by the human cognitive process of "perceive", "reason", and "correspond". Our evaluation shows the state-of-the-art models (e.g., Gemini-2.5-Pro) still fall far behind humans, with a gap of over 54.65% in overall accuracy, exposing a challenge in transitioning from coarse-grained judgement to fine-grained coordinate prediction. To address this problem, we construct CrossPoint-378K, a dataset with 378K question-answering pairs across 900 scenes, focused on actionable affordance regions that better reflect real-world manipulation and interaction scenarios. Furthermore, we propose CroPond that trained on the CrossPoint-378K dataset. Our CroPond achieves state-of-the-art performance on CrossPoint-Bench, surpassing Gemini-2.5-Pro by 39.7% accuracy, which offers a foundation for advancing future work on cross-view correspondence. The benchmark, dataset, and model are publicly available at https://github.com/WangYipu2002/CrossPoint.

Towards Cross-View Point Correspondence in Vision-Language Models

TL;DR

This work defines Cross-View Point Correspondence (CVPC) to enable precise point-level cross-view mapping for embodied AI. It introduces CrossPoint-Bench and CrossPoint-378K to evaluate and train models on fine-grained, affordance-driven cross-view tasks, and presents CroPond as a supervised-finetuned baseline that significantly closes the gap to human performance. Key findings show that current VLMs struggle with continuous coordinate prediction and cross-view geometry, but scale and targeted spatial supervision yield strong gains, including robust generalization to related spatial benchmarks and multi-agent scenarios. The results underscore the importance of geometric consistency and affordance-focused data for enabling reliable cross-view reasoning in real-world manipulation tasks.

Abstract

Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. However, it is still far from being realized in Vision-Language Models (VLMs), especially in achieving precise point-level correspondence, which is crucial for precise affordance interaction. So we propose the Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench, a comprehensive benchmark with hierarchical design, inspired by the human cognitive process of "perceive", "reason", and "correspond". Our evaluation shows the state-of-the-art models (e.g., Gemini-2.5-Pro) still fall far behind humans, with a gap of over 54.65% in overall accuracy, exposing a challenge in transitioning from coarse-grained judgement to fine-grained coordinate prediction. To address this problem, we construct CrossPoint-378K, a dataset with 378K question-answering pairs across 900 scenes, focused on actionable affordance regions that better reflect real-world manipulation and interaction scenarios. Furthermore, we propose CroPond that trained on the CrossPoint-378K dataset. Our CroPond achieves state-of-the-art performance on CrossPoint-Bench, surpassing Gemini-2.5-Pro by 39.7% accuracy, which offers a foundation for advancing future work on cross-view correspondence. The benchmark, dataset, and model are publicly available at https://github.com/WangYipu2002/CrossPoint.

Paper Structure

This paper contains 41 sections, 1 equation, 22 figures, 13 tables.

Figures (22)

  • Figure 1: Cross-View Point Correspondence (CVPC) builds point-level geometric correspondence across views. It involves: (a) understanding spatial instructions and locating targets (e.g., ground the location where the chair can be grasped); (b) reasoning about targets visibility (e.g., determining whether the red point is visible from slave robot 1’s view); (c) establishing point correspondence across views (e.g., corresponding this point from the master to slave robot 1). Such CVPC capability is pivotal for embodied robots to comprehend spatial layouts and execute downstream tasks. (e.g., multi-agent chair organization).
  • Figure 2: Overview of CrossPoint-Bench, which is divided into four categories, each covering two levels of affordance.
  • Figure 3: Overview of the CrossPoint-378K. CrossPoint-378K is a comprehensive dataset for Cross-View Point Correspondence, synthesized via a specially designed automated generation pipeline. S.U., S.G., and C.V. in (a) denote Single-View Spatial Understanding, Single-View Fine-Grained Grounding, and Cross-View Visibility Reasoning, respectively.
  • Figure 4: CrossPoint-Bench results. Qwen, G.M., and M.M. donate Qwen3-VL-235B-A22B-Insturct Qwen3-VL, Gemini-2.5-Pro gemini25pro, and Molmo-7B-D deitke2025molmo. CroPond-7B excels across the four cases.
  • Figure 5: Error distribution for different models. Spatial reconstruction failure dominates across all models, while frame transfer errors are mainly in open-source models.
  • ...and 17 more figures