Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation

Hengyi Wang; Ruiqiang Zhang; Chang Liu; Guanjie Wang; Zehua Ma; Han Fang; Weiming Zhang

Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation

Hengyi Wang, Ruiqiang Zhang, Chang Liu, Guanjie Wang, Zehua Ma, Han Fang, Weiming Zhang

TL;DR

The paper addresses the gap in Vision-Language Models' allocentric spatial reasoning caused by strong egocentric visual priors. It introduces Allocentric Perceiver, a training-free pipeline that lifts 2D inputs into a global metric space $\mathcal{W}$, instantiates a query-aligned allocentric frame $\mathcal{F}_{allo}$, and grounds reasoning in symbolic geometry prompts. This three-stage approach decouples perspective-taking from implicit visual priors, enabling reliable allocentric inferences across multiple backbones with about a 10% average gain on allocentric tasks while maintaining egocentric performance. Empirical results on ViewSpatial-Bench and 3DSRBench demonstrate cross-backbone improvements, underscoring the method's practicality, portability, and potential as a guidance for geometry-aware spatial reasoning in real-world embodied AI systems.

Abstract

With the rising need for spatially grounded tasks such as Vision-Language Navigation/Action, allocentric perception capabilities in Vision-Language Models (VLMs) are receiving growing focus. However, VLMs remain brittle on allocentric spatial queries that require explicit perspective shifts, where the answer depends on reasoning in a target-centric frame rather than the observed camera view. Thus, we introduce Allocentric Perceiver, a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts, and then instantiates a query-conditioned allocentric reference frame aligned with the instruction's semantic intent. By deterministically transforming reconstructed geometry into the target frame and prompting the backbone VLM with structured, geometry-grounded representations, Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation. We evaluate Allocentric Perciver across multiple backbone families on spatial reasoning benchmarks, observing consistent and substantial gains ($\sim$10%) on allocentric tasks while maintaining strong egocentric performance, and surpassing both spatial-perception-finetuned models and state-of-the-art open-source and proprietary models.

Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation

TL;DR

, instantiates a query-aligned allocentric frame

, and grounds reasoning in symbolic geometry prompts. This three-stage approach decouples perspective-taking from implicit visual priors, enabling reliable allocentric inferences across multiple backbones with about a 10% average gain on allocentric tasks while maintaining egocentric performance. Empirical results on ViewSpatial-Bench and 3DSRBench demonstrate cross-backbone improvements, underscoring the method's practicality, portability, and potential as a guidance for geometry-aware spatial reasoning in real-world embodied AI systems.

Abstract

10%) on allocentric tasks while maintaining strong egocentric performance, and surpassing both spatial-perception-finetuned models and state-of-the-art open-source and proprietary models.

Paper Structure (29 sections, 5 equations, 8 figures, 4 tables, 2 algorithms)

This paper contains 29 sections, 5 equations, 8 figures, 4 tables, 2 algorithms.

Introduction
Related Works
Training VLMs for Spatial Reasoning
Training-Free Spatial Reasoning via Prompting and Tool Use
3D-Aware Multimodal Large Language Models
Method
Metric-Aware Egocentric Perception
Dynamic Frame Instantiation
Symbolic Geometry Reasoning
Experiments
Experimental Settings
Enhancement Across VLMs
Comparison With More VLMs
Discussions
Whether Image inputting
...and 14 more sections

Figures (8)

Figure 1: Egocentric V.S. allocentric instructions.
Figure 2: The viewspatial-bench comprises two egocentric tasks and three allocentric tasks. "Stander input" denotes standard multimodal input, while "Text only" refers to text-only question inputs.
Figure 3: Illustration of Visual-Semantic Ambiguity. The validity of spatial descriptions is contingent upon the reference frame.
Figure 4: Framework of Alloceiver. To bridge the Reference Frame Gap, our framework explicitly decouples spatial reasoning from egocentric visual priors. The pipeline operates in three stages: (1) Metric-Aware Perception lifts 2D visual observations into a unified 3D metric world space ($\mathcal{W}$); (2) Dynamic Frame Instantiation constructs a query-aligned allocentric reference frame ($\mathcal{F}_{allo}$) via explicit coordinate transformation; and (3) Symbolic Geometry Reasoning derives the final answer through geometry-grounded logical deduction.
Figure 5: Comparison of Alloceiver's performance with other VLMs on typical allocentric questions. Although these VLMs (state-of-the-art commercial closed-source models and training models) attempt to substitute perspective, they are still affected by visual bias and reference frame gap, leading to failure.
...and 3 more figures

Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation

TL;DR

Abstract

Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)