Table of Contents
Fetching ...

Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale

David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu, Prithviraj Ammanabrolu, Hyunwoo Kim, Yuan-Hong Liao, Yejin Choi

TL;DR

This work presents Long Grounded Thoughts, a scalable two-stage data synthesis framework for vision-centric reasoning that generates over 1M grounded, compositional QA pairs by grounding questions to specific image regions and then composing them into harder problems. By distilling simple CoTs from VLMs and expanding them with reasoning LLMs, the approach yields rich, non-linear reasoning traces while staying in-distribution for the target model. Finetuning Qwen2.5-VL-7B on this data achieves state-of-the-art open-data performance on vision benchmarks and transfers to text-only and audio reasoning, with offline RL (DPO) nearly matching online RL (GRPO) in effectiveness and offering lower compute. The findings also reveal critical insights into post-training dynamics, such as the necessity of skill teaching before online RL, the scalability advantages of staged offline training, and the cross-modality transfer enabled by high-quality reasoning data, underscoring the practical impact of scalable, grounded reasoning datasets for multimodal AI.

Abstract

Recent progress in multimodal reasoning has been driven largely by undisclosed datasets and proprietary data synthesis recipes, leaving open questions about how to systematically build large-scale, vision-centric reasoning datasets, particularly for tasks that go beyond visual math. In this work, we introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions. The dataset also includes preference data and instruction prompts supporting both offline and online RL. Our synthesis framework proceeds in two stages: (1) scale; and (2) complexity. Reasoning traces are then synthesized through a two-stage process that leverages VLMs and reasoning LLMs, producing CoT traces for VLMs that capture the richness and diverse cognitive behaviors found in frontier reasoning models. Remarkably, we show that finetuning Qwen2.5-VL-7B on our data outperforms all open-data baselines across all evaluated vision-centric benchmarks, and even surpasses strong closed-data models such as MiMo-VL-7B-RL on V* Bench, CV-Bench and MMStar-V. Perhaps most surprising, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro) and audio reasoning (MMAU), demonstrating its effectiveness. Similarly, despite not containing videos or embodied visual data, we observe notable gains when evaluating on a single-evidence embodied QA benchmark (NiEH). Finally, we use our data to analyze the entire VLM post-training pipeline. Our empirical analysis highlights that (i) SFT on high-quality data with non-linear reasoning traces is essential for effective online RL, (ii) staged offline RL matches online RL's performance while reducing compute demands, and (iii) careful SFT on high quality data can substantially improve out-of-domain, cross-modality transfer.

Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale

TL;DR

This work presents Long Grounded Thoughts, a scalable two-stage data synthesis framework for vision-centric reasoning that generates over 1M grounded, compositional QA pairs by grounding questions to specific image regions and then composing them into harder problems. By distilling simple CoTs from VLMs and expanding them with reasoning LLMs, the approach yields rich, non-linear reasoning traces while staying in-distribution for the target model. Finetuning Qwen2.5-VL-7B on this data achieves state-of-the-art open-data performance on vision benchmarks and transfers to text-only and audio reasoning, with offline RL (DPO) nearly matching online RL (GRPO) in effectiveness and offering lower compute. The findings also reveal critical insights into post-training dynamics, such as the necessity of skill teaching before online RL, the scalability advantages of staged offline training, and the cross-modality transfer enabled by high-quality reasoning data, underscoring the practical impact of scalable, grounded reasoning datasets for multimodal AI.

Abstract

Recent progress in multimodal reasoning has been driven largely by undisclosed datasets and proprietary data synthesis recipes, leaving open questions about how to systematically build large-scale, vision-centric reasoning datasets, particularly for tasks that go beyond visual math. In this work, we introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions. The dataset also includes preference data and instruction prompts supporting both offline and online RL. Our synthesis framework proceeds in two stages: (1) scale; and (2) complexity. Reasoning traces are then synthesized through a two-stage process that leverages VLMs and reasoning LLMs, producing CoT traces for VLMs that capture the richness and diverse cognitive behaviors found in frontier reasoning models. Remarkably, we show that finetuning Qwen2.5-VL-7B on our data outperforms all open-data baselines across all evaluated vision-centric benchmarks, and even surpasses strong closed-data models such as MiMo-VL-7B-RL on V* Bench, CV-Bench and MMStar-V. Perhaps most surprising, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro) and audio reasoning (MMAU), demonstrating its effectiveness. Similarly, despite not containing videos or embodied visual data, we observe notable gains when evaluating on a single-evidence embodied QA benchmark (NiEH). Finally, we use our data to analyze the entire VLM post-training pipeline. Our empirical analysis highlights that (i) SFT on high-quality data with non-linear reasoning traces is essential for effective online RL, (ii) staged offline RL matches online RL's performance while reducing compute demands, and (iii) careful SFT on high quality data can substantially improve out-of-domain, cross-modality transfer.

Paper Structure

This paper contains 22 sections, 6 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Overview of our two-stage synthesis framework. First, we synthesize multiple-choice questions (MCQs) from dense captions and grounded object metadata, emphasizing scale and diversity while teaching basic cognitive behaviors (verification, backtracking, correction). Later, we harden questions by composing them into visual reasoning problems that requires decomposition and higher-order reasoning. For each stage, we also synthesize reasoning traces by first distilling CoTs from VLMs and then expanding them with reasoning LLMs, yielding traces that are in the distribution of VLM outputs yet richer in reasoning depth.
  • Figure 2: Scaling Behaviour of LPT vs Ours for SFT. We find that using additional metadata (here bounding boxes) in addition to highly details captions allows for more diverse and controlled generation of MCQ successfully scaling beyond 1M+ examples.
  • Figure 3: Reasoning trace comparison between our model (post-SFT and RL) and the vanilla base model. Both models initially fail to identify the dog in the image. The base model terminates with an incorrect answer based on this flawed premise. In contrast, our model demonstrates a non-linear reasoning process; it employs self-verification and backtracking to challenge and self-correct its initial assessment. This correction appears to stem from a trace where the model relies on captioning and grounding as a bridge between language and vision; notably grounding on the dog triggers the revised path on a second "self-captioned" verification structure. This behavior is notable as captions were not explicitly included in the training traces, perhaps suggesting captioning and grounding as part of the thinking process could be an emergent capability of training on our data.
  • Figure 4: Analysis of our data splits.(a) Complexity estimation via multiple rollouts on synthesized MCQs using Qwen2.5-VL as a policy. Darker green color represents easier problems. (b) Analysis of Cognitive Behaviors in CoTs. Our data exhibits higher frequencies of subgoal setting, backtracking, and verification, indicating a more deliberate and structured reasoning process. Estimation of cognitive behaviours and terminology was borrowed from gandhi2025cognitivebehaviorsenableselfimproving.
  • Figure 5: Quantitative and qualitative comparison of the post-training pipeline on our data vs pure RL on the base model. (Right) The graph illustrates the effect of scaling dataset size during online RL. The baseline (blue line), starting from an off-the-shelf model, exhibits negative scaling: performance peaks at 0.695 (10K samples) and degrades with more data. In contrast, our method (green line), which includes SFT on our high-quality data with complex reasoning traces, allows to scale online RL further. This suggests that without offline "skill teaching" via SFT, online RL fails to effectively utilize larger datasets. (Left) A qualitative example (from V* bench), using each model's best checkpoint (indicated by a dot on the curve), highlights the resulting difference in reasoning. The baseline model fails to identify the partially obscured dog and answers incorrectly. Our model also initially expresses confusion but then self-corrects ("Wait, I'm getting conflicting information..."), showcasing a multi-step reasoning process to arrive at the correct answer. This self-correction capability, instilled with our data, is not observed in the baseline, indicating RL alone was insufficient to elicit this behavior. Image brightness was increased for illustration purposes.
  • ...and 3 more figures