SpatialTraceGen: High-Fidelity Traces for Efficient VLM Spatial Reasoning Distillation

Gio Huh; Dhruv Sheth; Rayhan Zirvi; Frank Xiao

SpatialTraceGen: High-Fidelity Traces for Efficient VLM Spatial Reasoning Distillation

Gio Huh, Dhruv Sheth, Rayhan Zirvi, Frank Xiao

TL;DR

Vision-Language Models struggle with complex spatial reasoning due to reliance on textual priors. SpatialTraceGen distills large-model reasoning into verifier-vetted, multi-tool reasoning traces reformulated as offline RL data to enable efficient fine-tuning of smaller models. The framework introduces a data-generation pipeline, a verifier-driven quality improvement, and a dataset suitable for supervised fine-tuning and offline RL, with a reported 17% gain in trace quality and >40% reduction in variance on CLEVR-Humans. Despite unchanged final-answer accuracy on a simple benchmark (74%), the approach strengthens the reasoning signal for downstream training. This work offers a practical path to sample-efficient learning of spatial reasoning through high-quality demonstrations and tool-based cognition.

Abstract

While Vision-Language Models (VLMs) excel in many areas, they struggle with complex spatial reasoning, which requires problem decomposition and strategic tool use. Fine-tuning smaller, more deployable models offers an efficient path to strong performance, but this is hampered by a major bottleneck: the absence of high-quality, step-by-step reasoning data. To address this data-efficiency gap, we introduce SpatialTraceGen, a framework to distill the reasoning processes of a large teacher model into a high-quality dataset of multi-hop, multi-tool reasoning traces. A key innovation is our automated Verifier, which scalably ensures the fidelity of each reasoning step, providing a cost-effective alternative to manual human annotation. On the CLEVR-Humans benchmark, this verifier-guided process improves the average quality score of traces by 17\% while reducing quality variance by over 40\%. SpatialTraceGen delivers a dataset of expert traces, providing the structured, step-by-step examples of tool use necessary for effective fine-tuning and sample-efficient offline reinforcement learning.

SpatialTraceGen: High-Fidelity Traces for Efficient VLM Spatial Reasoning Distillation

TL;DR

Abstract

SpatialTraceGen: High-Fidelity Traces for Efficient VLM Spatial Reasoning Distillation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)