Table of Contents
Fetching ...

STAR: Mitigating Cascading Errors in Spatial Reasoning via Turn-point Alignment and Segment-level DPO

Pukun Zhao, Longxiang Wang, Chen Chen, Peicheng Wang, Fanqing Zhou, Runze Li, Haojian Huang

Abstract

Structured spatial navigation is a core benchmark for Large Language Models (LLMs) spatial reasoning. Existing paradigms like Visualization-of-Thought (VoT) are prone to cascading errors in complex topologies. To solve this, we propose STAR, a two-stage framework grounded on topological anchors, and introduce the RedMaze-23K dataset with human-inspired turnpoint annotations. The first stage uses supervised fine-tuning to help models internalize spatial semantics and prune redundant paths. The second adopts Spatial-aware Segment-level Direct Preference Optimization (SDPO) to refine self-correction in long-horizon navigation. Experiments show STAR achieves state-of-the-art performance among open-source models: its 32B variant outperforms DeepSeek-V3 (29.27% vs. 25.00%) and reaches 82.4% of GPT-4's performance.

STAR: Mitigating Cascading Errors in Spatial Reasoning via Turn-point Alignment and Segment-level DPO

Abstract

Structured spatial navigation is a core benchmark for Large Language Models (LLMs) spatial reasoning. Existing paradigms like Visualization-of-Thought (VoT) are prone to cascading errors in complex topologies. To solve this, we propose STAR, a two-stage framework grounded on topological anchors, and introduce the RedMaze-23K dataset with human-inspired turnpoint annotations. The first stage uses supervised fine-tuning to help models internalize spatial semantics and prune redundant paths. The second adopts Spatial-aware Segment-level Direct Preference Optimization (SDPO) to refine self-correction in long-horizon navigation. Experiments show STAR achieves state-of-the-art performance among open-source models: its 32B variant outperforms DeepSeek-V3 (29.27% vs. 25.00%) and reaches 82.4% of GPT-4's performance.

Paper Structure

This paper contains 23 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Strategic Navigation Heuristics: Traditional Sequential Exploration vs. Human-inspired Turn-point Anchoring
  • Figure 2: Framework of STAR. STAR adopts a two-stage paradigm: Stage 1 utilizes Supervised Fine-Tuning (SFT) on RedMaze-23K to ground static spatial understanding, while Stage 2 employs Spatial-aware Direct Preference Optimization (SDPO) to refine dynamic decision-making via segment-level alignment. This progression systematically enhances accuracy and robustness by mitigating reasoning drift at critical topological anchors.
  • Figure 3: Attention Score Comparison (Base VS Ours). Visualizing the attention scores shows that our method focuses more accurately on critical turn-points compared to the baseline.
  • Figure 4: Confidence Score Comparison (Base VS Ours). Comparison of base model and STAR-enhanced model on our dataset, with their confidence scores in selecting the positive sample(A) over various negative samples (B, C, D, E). STAR-enhanced model consistently achieved higher scores across four pairs of positive and negative samples.
  • Figure 5: Qualitative Comparison of Reasoning Paradigms in Structured Spatial Navigation. We illustrate the divergence in decision-making logic between three paradigms: (1) CoT (left) relies on purely textual decomposition but frequently fails to ground spatial relationships in complex junctions, leading to invalid moves (e.g., colliding with obstacles). (2) VoT (middle) introduces step-wise textual analysis to find the shortest path; however, it lacks explicit visual anchoring, making it vulnerable to cascading errors during long-horizon reasoning. (3) Ours (STAR) (right) implements Topological Anchoring, where the model is required to confirm its current position by re-visualizing the traversed path and marking its status with a visual icon at each critical turn-point. This structural feedback loop ensures that the model internalizes the maze's topological constraints, effectively mitigating reasoning drift and ensuring policy consistency from origin to destination.
  • ...and 1 more figures