Table of Contents
Fetching ...

Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation

Hengyi Wang, Ruiqiang Zhang, Chang Liu, Guanjie Wang, Zehua Ma, Han Fang, Weiming Zhang

TL;DR

The paper addresses the gap in Vision-Language Models' allocentric spatial reasoning caused by strong egocentric visual priors. It introduces Allocentric Perceiver, a training-free pipeline that lifts 2D inputs into a global metric space $\mathcal{W}$, instantiates a query-aligned allocentric frame $\mathcal{F}_{allo}$, and grounds reasoning in symbolic geometry prompts. This three-stage approach decouples perspective-taking from implicit visual priors, enabling reliable allocentric inferences across multiple backbones with about a 10% average gain on allocentric tasks while maintaining egocentric performance. Empirical results on ViewSpatial-Bench and 3DSRBench demonstrate cross-backbone improvements, underscoring the method's practicality, portability, and potential as a guidance for geometry-aware spatial reasoning in real-world embodied AI systems.

Abstract

With the rising need for spatially grounded tasks such as Vision-Language Navigation/Action, allocentric perception capabilities in Vision-Language Models (VLMs) are receiving growing focus. However, VLMs remain brittle on allocentric spatial queries that require explicit perspective shifts, where the answer depends on reasoning in a target-centric frame rather than the observed camera view. Thus, we introduce Allocentric Perceiver, a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts, and then instantiates a query-conditioned allocentric reference frame aligned with the instruction's semantic intent. By deterministically transforming reconstructed geometry into the target frame and prompting the backbone VLM with structured, geometry-grounded representations, Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation. We evaluate Allocentric Perciver across multiple backbone families on spatial reasoning benchmarks, observing consistent and substantial gains ($\sim$10%) on allocentric tasks while maintaining strong egocentric performance, and surpassing both spatial-perception-finetuned models and state-of-the-art open-source and proprietary models.

Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation

TL;DR

The paper addresses the gap in Vision-Language Models' allocentric spatial reasoning caused by strong egocentric visual priors. It introduces Allocentric Perceiver, a training-free pipeline that lifts 2D inputs into a global metric space , instantiates a query-aligned allocentric frame , and grounds reasoning in symbolic geometry prompts. This three-stage approach decouples perspective-taking from implicit visual priors, enabling reliable allocentric inferences across multiple backbones with about a 10% average gain on allocentric tasks while maintaining egocentric performance. Empirical results on ViewSpatial-Bench and 3DSRBench demonstrate cross-backbone improvements, underscoring the method's practicality, portability, and potential as a guidance for geometry-aware spatial reasoning in real-world embodied AI systems.

Abstract

With the rising need for spatially grounded tasks such as Vision-Language Navigation/Action, allocentric perception capabilities in Vision-Language Models (VLMs) are receiving growing focus. However, VLMs remain brittle on allocentric spatial queries that require explicit perspective shifts, where the answer depends on reasoning in a target-centric frame rather than the observed camera view. Thus, we introduce Allocentric Perceiver, a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts, and then instantiates a query-conditioned allocentric reference frame aligned with the instruction's semantic intent. By deterministically transforming reconstructed geometry into the target frame and prompting the backbone VLM with structured, geometry-grounded representations, Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation. We evaluate Allocentric Perciver across multiple backbone families on spatial reasoning benchmarks, observing consistent and substantial gains (10%) on allocentric tasks while maintaining strong egocentric performance, and surpassing both spatial-perception-finetuned models and state-of-the-art open-source and proprietary models.
Paper Structure (29 sections, 5 equations, 8 figures, 4 tables, 2 algorithms)

This paper contains 29 sections, 5 equations, 8 figures, 4 tables, 2 algorithms.

Figures (8)

  • Figure 1: Egocentric V.S. allocentric instructions.
  • Figure 2: The viewspatial-bench comprises two egocentric tasks and three allocentric tasks. "Stander input" denotes standard multimodal input, while "Text only" refers to text-only question inputs.
  • Figure 3: Illustration of Visual-Semantic Ambiguity. The validity of spatial descriptions is contingent upon the reference frame.
  • Figure 4: Framework of Alloceiver. To bridge the Reference Frame Gap, our framework explicitly decouples spatial reasoning from egocentric visual priors. The pipeline operates in three stages: (1) Metric-Aware Perception lifts 2D visual observations into a unified 3D metric world space ($\mathcal{W}$); (2) Dynamic Frame Instantiation constructs a query-aligned allocentric reference frame ($\mathcal{F}_{allo}$) via explicit coordinate transformation; and (3) Symbolic Geometry Reasoning derives the final answer through geometry-grounded logical deduction.
  • Figure 5: Comparison of Alloceiver's performance with other VLMs on typical allocentric questions. Although these VLMs (state-of-the-art commercial closed-source models and training models) attempt to substitute perspective, they are still affected by visual bias and reference frame gap, leading to failure.
  • ...and 3 more figures