Table of Contents
Fetching ...

AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO

Alan Dao, Dinh Bach Vu

TL;DR

AlphaMaze tackles the challenge of endowing standard LLMs with visual spatial reasoning for maze navigation. It introduces a two-stage training pipeline—Supervised Fine-Tuning on tokenized maze representations to learn step-by-step actions, followed by Group Relative Policy Optimization with a tailored reward design to refine reasoning and promote emergent chain-of-thought. On MazeBench, the SFT stage achieves 86% accuracy and GRPO further boosts performance to 93% after 1600 steps, with qualitative analysis revealing self-corrective reasoning and CoT-like patterns. The work demonstrates a path to bridge language models with visual-spatial tasks, with potential applications in robotics and autonomous navigation, and suggests avenues for extending the approach to broader cognitive and planning domains.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in language processing, yet they often struggle with tasks requiring genuine visual spatial reasoning. In this paper, we introduce a novel two-stage training framework designed to equip standard LLMs with visual reasoning abilities for maze navigation. First, we leverage Supervised Fine Tuning (SFT) on a curated dataset of tokenized maze representations to teach the model to predict step-by-step movement commands. Next, we apply Group Relative Policy Optimization (GRPO)-a technique used in DeepSeekR1-with a carefully crafted reward function to refine the model's sequential decision-making and encourage emergent chain-of-thought behaviors. Experimental results on synthetically generated mazes show that while a baseline model fails to navigate the maze, the SFT-trained model achieves 86% accuracy, and further GRPO fine-tuning boosts accuracy to 93%. Qualitative analyses reveal that GRPO fosters more robust and self-corrective reasoning, highlighting the potential of our approach to bridge the gap between language models and visual spatial tasks. These findings offer promising implications for applications in robotics, autonomous navigation, and other domains that require integrated visual and sequential reasoning.

AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO

TL;DR

AlphaMaze tackles the challenge of endowing standard LLMs with visual spatial reasoning for maze navigation. It introduces a two-stage training pipeline—Supervised Fine-Tuning on tokenized maze representations to learn step-by-step actions, followed by Group Relative Policy Optimization with a tailored reward design to refine reasoning and promote emergent chain-of-thought. On MazeBench, the SFT stage achieves 86% accuracy and GRPO further boosts performance to 93% after 1600 steps, with qualitative analysis revealing self-corrective reasoning and CoT-like patterns. The work demonstrates a path to bridge language models with visual-spatial tasks, with potential applications in robotics and autonomous navigation, and suggests avenues for extending the approach to broader cognitive and planning domains.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in language processing, yet they often struggle with tasks requiring genuine visual spatial reasoning. In this paper, we introduce a novel two-stage training framework designed to equip standard LLMs with visual reasoning abilities for maze navigation. First, we leverage Supervised Fine Tuning (SFT) on a curated dataset of tokenized maze representations to teach the model to predict step-by-step movement commands. Next, we apply Group Relative Policy Optimization (GRPO)-a technique used in DeepSeekR1-with a carefully crafted reward function to refine the model's sequential decision-making and encourage emergent chain-of-thought behaviors. Experimental results on synthetically generated mazes show that while a baseline model fails to navigate the maze, the SFT-trained model achieves 86% accuracy, and further GRPO fine-tuning boosts accuracy to 93%. Qualitative analyses reveal that GRPO fosters more robust and self-corrective reasoning, highlighting the potential of our approach to bridge the gap between language models and visual spatial tasks. These findings offer promising implications for applications in robotics, autonomous navigation, and other domains that require integrated visual and sequential reasoning.

Paper Structure

This paper contains 26 sections, 3 figures, 2 tables, 3 algorithms.

Figures (3)

  • Figure 1: MazeBench scores over GRPO steps with a linear regression trendline and its $\pm1$ standard deviation bounds.
  • Figure 2: Visual of the Example Maze
  • Figure 3: Visualization of AlphaMaze's step-by-step reasoning process while solving a maze.