Table of Contents
Fetching ...

Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models

Jaeyun Jang, Seunghui Shin, Taeho Park, Hyoseok Hwang

TL;DR

SymPL introduces Symbolic Projective Layout to recast allocentric spatial reasoning into symbolic-layout questions using four factors—Projection, Abstraction, Bipartition, and Localization. The framework operates in two stages: Spatial Information Extraction to build a 3D scene representation, and Question Reformulation to generate a symbolic layout that aligns with VLM strengths. Across five datasets and multiple VLM baselines, SymPL yields substantial gains in allocentric reasoning and enhances egocentric performance and robustness to visual illusions and viewpoint changes, with ablations confirming the contribution of each factor. This principled reformulation enables robust, perspective-aware spatial reasoning in vision–language systems without extensive retraining or data collection, broadening applicability to real-world tasks requiring multi-view understanding.

Abstract

Perspective-aware spatial reasoning involves understanding spatial relationships from specific viewpoints-either egocentric (observer-centered) or allocentric (object-centered). While vision-language models (VLMs) perform well in egocentric settings, their performance deteriorates when reasoning from allocentric viewpoints, where spatial relations must be inferred from the perspective of objects within the scene. In this study, we address this underexplored challenge by introducing Symbolic Projective Layout (SymPL), a framework that reformulates allocentric reasoning into symbolic-layout forms that VLMs inherently handle well. By leveraging four key factors-projection, abstraction, bipartition, and localization-SymPL converts allocentric questions into structured symbolic-layout representations. Extensive experiments demonstrate that this reformulation substantially improves performance in both allocentric and egocentric tasks, enhances robustness under visual illusions and multi-view scenarios, and that each component contributes critically to these gains. These results show that SymPL provides an effective and principled approach for addressing complex perspective-aware spatial reasoning.

Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models

TL;DR

SymPL introduces Symbolic Projective Layout to recast allocentric spatial reasoning into symbolic-layout questions using four factors—Projection, Abstraction, Bipartition, and Localization. The framework operates in two stages: Spatial Information Extraction to build a 3D scene representation, and Question Reformulation to generate a symbolic layout that aligns with VLM strengths. Across five datasets and multiple VLM baselines, SymPL yields substantial gains in allocentric reasoning and enhances egocentric performance and robustness to visual illusions and viewpoint changes, with ablations confirming the contribution of each factor. This principled reformulation enables robust, perspective-aware spatial reasoning in vision–language systems without extensive retraining or data collection, broadening applicability to real-world tasks requiring multi-view understanding.

Abstract

Perspective-aware spatial reasoning involves understanding spatial relationships from specific viewpoints-either egocentric (observer-centered) or allocentric (object-centered). While vision-language models (VLMs) perform well in egocentric settings, their performance deteriorates when reasoning from allocentric viewpoints, where spatial relations must be inferred from the perspective of objects within the scene. In this study, we address this underexplored challenge by introducing Symbolic Projective Layout (SymPL), a framework that reformulates allocentric reasoning into symbolic-layout forms that VLMs inherently handle well. By leveraging four key factors-projection, abstraction, bipartition, and localization-SymPL converts allocentric questions into structured symbolic-layout representations. Extensive experiments demonstrate that this reformulation substantially improves performance in both allocentric and egocentric tasks, enhances robustness under visual illusions and multi-view scenarios, and that each component contributes critically to these gains. These results show that SymPL provides an effective and principled approach for addressing complex perspective-aware spatial reasoning.
Paper Structure (21 sections, 6 figures, 5 tables)

This paper contains 21 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: SymPL reformulates allocentric questions into symbolic-layout questions using four factors-projection, abstraction, bipartition, and localization-enabling significantly improved spatial reasoning under allocentric settings.
  • Figure 2: Overview of SymPL framework. SymPL reformulates an allocentric question into a symbolic-layout question through two stages: 1) Spatial Information Extraction and 2) Question Reformulation using four key factors — projection, abstraction, bipartition, and localization.
  • Figure 3: Partition rule based on spatial reasoning category. Directional comparisons adopt a linear partition, while distance comparisons employ a circular one.
  • Figure 4: Allocentric spatial reasoning examples. Qwen2.5-VL + SoM and APC-Vis exhibited limited allocentric spatial reasoning performance across various categories. In contrast, our SymPL effectively handled allocentric questions by reformulating them into symbolic-layout questions.
  • Figure 5: Ablation results of each key factor. (a) projection, (b) abstraction, (c) bipartition, (d) localization. The darker bar indicates the configuration used in SymPL.
  • ...and 1 more figures