Table of Contents
Fetching ...

SpatialTraceGen: High-Fidelity Traces for Efficient VLM Spatial Reasoning Distillation

Gio Huh, Dhruv Sheth, Rayhan Zirvi, Frank Xiao

TL;DR

Vision-Language Models struggle with complex spatial reasoning due to reliance on textual priors. SpatialTraceGen distills large-model reasoning into verifier-vetted, multi-tool reasoning traces reformulated as offline RL data to enable efficient fine-tuning of smaller models. The framework introduces a data-generation pipeline, a verifier-driven quality improvement, and a dataset suitable for supervised fine-tuning and offline RL, with a reported 17% gain in trace quality and >40% reduction in variance on CLEVR-Humans. Despite unchanged final-answer accuracy on a simple benchmark (74%), the approach strengthens the reasoning signal for downstream training. This work offers a practical path to sample-efficient learning of spatial reasoning through high-quality demonstrations and tool-based cognition.

Abstract

While Vision-Language Models (VLMs) excel in many areas, they struggle with complex spatial reasoning, which requires problem decomposition and strategic tool use. Fine-tuning smaller, more deployable models offers an efficient path to strong performance, but this is hampered by a major bottleneck: the absence of high-quality, step-by-step reasoning data. To address this data-efficiency gap, we introduce SpatialTraceGen, a framework to distill the reasoning processes of a large teacher model into a high-quality dataset of multi-hop, multi-tool reasoning traces. A key innovation is our automated Verifier, which scalably ensures the fidelity of each reasoning step, providing a cost-effective alternative to manual human annotation. On the CLEVR-Humans benchmark, this verifier-guided process improves the average quality score of traces by 17\% while reducing quality variance by over 40\%. SpatialTraceGen delivers a dataset of expert traces, providing the structured, step-by-step examples of tool use necessary for effective fine-tuning and sample-efficient offline reinforcement learning.

SpatialTraceGen: High-Fidelity Traces for Efficient VLM Spatial Reasoning Distillation

TL;DR

Vision-Language Models struggle with complex spatial reasoning due to reliance on textual priors. SpatialTraceGen distills large-model reasoning into verifier-vetted, multi-tool reasoning traces reformulated as offline RL data to enable efficient fine-tuning of smaller models. The framework introduces a data-generation pipeline, a verifier-driven quality improvement, and a dataset suitable for supervised fine-tuning and offline RL, with a reported 17% gain in trace quality and >40% reduction in variance on CLEVR-Humans. Despite unchanged final-answer accuracy on a simple benchmark (74%), the approach strengthens the reasoning signal for downstream training. This work offers a practical path to sample-efficient learning of spatial reasoning through high-quality demonstrations and tool-based cognition.

Abstract

While Vision-Language Models (VLMs) excel in many areas, they struggle with complex spatial reasoning, which requires problem decomposition and strategic tool use. Fine-tuning smaller, more deployable models offers an efficient path to strong performance, but this is hampered by a major bottleneck: the absence of high-quality, step-by-step reasoning data. To address this data-efficiency gap, we introduce SpatialTraceGen, a framework to distill the reasoning processes of a large teacher model into a high-quality dataset of multi-hop, multi-tool reasoning traces. A key innovation is our automated Verifier, which scalably ensures the fidelity of each reasoning step, providing a cost-effective alternative to manual human annotation. On the CLEVR-Humans benchmark, this verifier-guided process improves the average quality score of traces by 17\% while reducing quality variance by over 40\%. SpatialTraceGen delivers a dataset of expert traces, providing the structured, step-by-step examples of tool use necessary for effective fine-tuning and sample-efficient offline reinforcement learning.

Paper Structure

This paper contains 22 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: SpatialTraceGen pipeline. The Single Hop Generator (yellow) breaks spatial queries into steps, invokes vision tools (blue), and records traces. A Verifier LLM validates each step before inclusion in the SpatialTrace corpus (red). Green boxes show API/data flow.
  • Figure 2: Tool visualizations using CLEVR-Humans sample images. Our framework integrates diverse vision tools to provide rich, multi-modal information for spatial reasoning.
  • Figure 3: Impact of verification threshold $\tau$ on reasoning quality scores
  • Figure 4: Distribution of tool calls over different minimum acceptance thresholds ($\tau$)
  • Figure 5: Example trace for $\tau = 4.0$. Reasoning performed on question 22 (basic verification).
  • ...and 1 more figures