Table of Contents
Fetching ...

Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries

Minghe Shen, Zhuo Zhi, Chonghan Liu, Shuo Xing, Zhengzhong Tu, Che Liu

TL;DR

Ariadne introduces a controllable RLVR framework to probe and extend Vision-Language Model spatial reasoning using synthetic mazes with tunable difficulty. The approach yields substantial gains, including over $50\%$ accuracy on tasks where the base model failed, but exhibits a partial extension with dimension-specific generalization limits and divergent real-world behavior. Despite training only on synthetic mazes, the method shows zero-shot improvements on real-world benchmarks MapBench and ReasonMap, indicating practical transfer and broader impact for spatial reasoning. The work highlights the potential and limits of capability-extending alignment, and calls for aligning pretraining and evaluation environments to better capture real-world complexity.

Abstract

While Vision-Language Models (VLMs) post-trained with Reinforcement Learning (RL) show impressive general reasoning, their evaluation is often confined to language-dominant tasks (e.g., math). This raises a critical question: can RL post-training truly extend the inherent capability boundary of a base VLM, particularly for visual-centric spatial tasks where it initially fails? To investigate this, we introduce Ariadne, a framework utilizing synthetic mazes for multi-step spatial reasoning where task difficulty (e.g., path length, turns) is precisely controlled. We leverage this controllable environment to train VLMs using Reinforcement Learning with Verified Rewards (RLVR) in a difficulty-aware curriculum. Surprisingly, post-RLVR training, the VLM achieves over 50% accuracy on a problem set where the base model scored 0%, demonstrating that our approach expands the model's initial capability boundary. To assess real-world viability, we evaluate out-of-distribution (OOD) generalization on practical benchmarks. Despite training only on synthetic maze samples, Ariadne achieves significant zero-shot improvements, averaging 16% on MapBench (e.g., museum navigation) and 24% on ReasonMap (subway transfer tasks). These results confirm that our method not only broadens the model's fundamental limits but also enhances its generalization to real-world spatial reasoning. We acknowledge our study is limited to the post-training phase, given the opaqueness of pre-training data, and hope our research motivates further work on specialized, capability-extending alignment.

Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries

TL;DR

Ariadne introduces a controllable RLVR framework to probe and extend Vision-Language Model spatial reasoning using synthetic mazes with tunable difficulty. The approach yields substantial gains, including over accuracy on tasks where the base model failed, but exhibits a partial extension with dimension-specific generalization limits and divergent real-world behavior. Despite training only on synthetic mazes, the method shows zero-shot improvements on real-world benchmarks MapBench and ReasonMap, indicating practical transfer and broader impact for spatial reasoning. The work highlights the potential and limits of capability-extending alignment, and calls for aligning pretraining and evaluation environments to better capture real-world complexity.

Abstract

While Vision-Language Models (VLMs) post-trained with Reinforcement Learning (RL) show impressive general reasoning, their evaluation is often confined to language-dominant tasks (e.g., math). This raises a critical question: can RL post-training truly extend the inherent capability boundary of a base VLM, particularly for visual-centric spatial tasks where it initially fails? To investigate this, we introduce Ariadne, a framework utilizing synthetic mazes for multi-step spatial reasoning where task difficulty (e.g., path length, turns) is precisely controlled. We leverage this controllable environment to train VLMs using Reinforcement Learning with Verified Rewards (RLVR) in a difficulty-aware curriculum. Surprisingly, post-RLVR training, the VLM achieves over 50% accuracy on a problem set where the base model scored 0%, demonstrating that our approach expands the model's initial capability boundary. To assess real-world viability, we evaluate out-of-distribution (OOD) generalization on practical benchmarks. Despite training only on synthetic maze samples, Ariadne achieves significant zero-shot improvements, averaging 16% on MapBench (e.g., museum navigation) and 24% on ReasonMap (subway transfer tasks). These results confirm that our method not only broadens the model's fundamental limits but also enhances its generalization to real-world spatial reasoning. We acknowledge our study is limited to the post-training phase, given the opaqueness of pre-training data, and hope our research motivates further work on specialized, capability-extending alignment.

Paper Structure

This paper contains 19 sections, 5 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: (A) Step-length distribution in the AlphaMaze dao2025alphamaze training set, where the number of moves $s \in \{1,2,3,4,5\}$ is sampled according to an inverted Gaussian-like distribution centered at $s=3$, ensuring higher frequencies for both simple and complex cases. (B) Distribution of directional turns in the training set under the same controlled sampling scheme. (C) Step-length distribution in the AlphaMaze dao2025alphamaze testing set, constructed via uniform sampling. (D) Distribution of directional turns in the testing set under the same controlled sampling scheme.
  • Figure 2: Illustrative examples from our two controlled benchmarks for path-finding evaluation. MapBench (left) features human-readable, outdoor navigation tasks derived from challenging real-world scenarios (e.g., mall navigation), designed to assess naturalistic instruction-following and local decision-making. ReasonMap (right) uses high-resolution transit maps from global metropolitan systems (e.g., Los Angeles, Toronto, Beijing), with a two-tier evaluation (short vs. long questions) to probe both fine-grained visual comprehension and global route planning.
  • Figure 3: Training reward dynamics and evaluation of path-following ability. (A) Reward curve during GRPO training, showing steady improvement in rewards and stable learning progress. (B, D) For Qwen2.5-VL-7B-Instruct, the success rate rapidly collapses to zero (at cases with 3 movement steps or 3 turns), while token length increases, suggesting that the VLM generates longer but unsuccessful trajectories. (C, E) In contrast, our Ariadne framework demonstrates a remarkable improvement in success rates at the base VLM boundary, raising performance from 0% to 50% on 3-step cases and from 0% to over 10% on 3-turn cases. Token length grows moderately, and the collapse point shifts from 3 to 5, reflecting that after RLVR training, the VLM succeeds on tasks where the base model consistently failed, indicating an extended reasoning boundary.
  • Figure 4: Representative success (top row) and failure (bottom row) cases from the AlphaMaze test set under controlled step sizes (4, 6, 8). Success cases generally correspond to smoother layouts with limited detour requirements, enabling coherent long-range navigation. Failure cases, by contrast, arise in locally complex structures characterized by dense turns, narrow passages, and elongated detours, which challenge the model’s ability to maintain global path consistency.
  • Figure 5: Navigation results on MapBench xing2025can. Left: Trail task with an unstructured outdoor layout. Right: Museum task with a structured indoor layout. The baseline Qwen2.5-VL-7B-Instruct model bai2025qwen25vltechnicalreport (red) produces incomplete or less feasible trajectories due to issues such as target misidentification, inefficient detours, or violations of environmental constraints. In comparison, Ariadne (green) demonstrates more consistent and goal-directed planning while adhering to structural boundaries.
  • ...and 2 more figures