Table of Contents
Fetching ...

LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception

Yuan-Hong Liao, Sven Elflein, Liu He, Laura Leal-Taixé, Yejin Choi, Sanja Fidler, David Acuna

TL;DR

LongPerceptualThoughts introduces a scalable three-stage data synthesis framework to distill system-2 reasoning into instruction-tuned vision-language models for perception. The authors create a 30K long-chain of thought dataset by generating verifiable MCQs from dense captions, extracting simple CoTs from VLMs, and expanding them with frontier reasoning models, enabling SFT and DPO fine-tuning. Empirically, fine-tuning on this data yields average gains of 3.4 accuracy points across five vision benchmarks and substantial improvement on V* Bench, with transferable benefits to a text-only reasoning task (MMLU-Pro) of about 2 points. The work demonstrates that structured long CoTs can enhance perception as well as cross-domain reasoning, suggesting a practical pathway to leverage synthetic long-form reasoning in multimodal models.

Abstract

Recent reasoning models through test-time scaling have demonstrated that long chain-of-thoughts can unlock substantial performance boosts in hard reasoning tasks such as math and code. However, the benefit of such long thoughts for system-2 reasoning is relatively less explored in other domains such as perceptual tasks where shallower, system-1 reasoning seems sufficient. In this paper, we introduce LongPerceptualThoughts, a new synthetic dataset with 30K long-thought traces for perceptual tasks. The key challenges in synthesizing elaborate reasoning thoughts for perceptual tasks are that off-the-shelf models are not yet equipped with such thinking behavior and that it is not straightforward to build a reliable process verifier for perceptual tasks. Thus, we propose a novel three-stage data synthesis framework that first synthesizes verifiable multiple-choice questions from dense image descriptions, then extracts simple CoTs from VLMs for those verifiable problems, and finally expands those simple thoughts to elaborate long thoughts via frontier reasoning models. In controlled experiments with a strong instruction-tuned 7B model, we demonstrate notable improvements over existing visual reasoning data-generation methods. Our model, trained on the generated dataset, achieves an average +3.4 points improvement over 5 vision-centric benchmarks, including +11.8 points on V$^*$ Bench. Notably, despite being tuned for vision tasks, it also improves performance on the text reasoning benchmark, MMLU-Pro, by +2 points.

LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception

TL;DR

LongPerceptualThoughts introduces a scalable three-stage data synthesis framework to distill system-2 reasoning into instruction-tuned vision-language models for perception. The authors create a 30K long-chain of thought dataset by generating verifiable MCQs from dense captions, extracting simple CoTs from VLMs, and expanding them with frontier reasoning models, enabling SFT and DPO fine-tuning. Empirically, fine-tuning on this data yields average gains of 3.4 accuracy points across five vision benchmarks and substantial improvement on V* Bench, with transferable benefits to a text-only reasoning task (MMLU-Pro) of about 2 points. The work demonstrates that structured long CoTs can enhance perception as well as cross-domain reasoning, suggesting a practical pathway to leverage synthetic long-form reasoning in multimodal models.

Abstract

Recent reasoning models through test-time scaling have demonstrated that long chain-of-thoughts can unlock substantial performance boosts in hard reasoning tasks such as math and code. However, the benefit of such long thoughts for system-2 reasoning is relatively less explored in other domains such as perceptual tasks where shallower, system-1 reasoning seems sufficient. In this paper, we introduce LongPerceptualThoughts, a new synthetic dataset with 30K long-thought traces for perceptual tasks. The key challenges in synthesizing elaborate reasoning thoughts for perceptual tasks are that off-the-shelf models are not yet equipped with such thinking behavior and that it is not straightforward to build a reliable process verifier for perceptual tasks. Thus, we propose a novel three-stage data synthesis framework that first synthesizes verifiable multiple-choice questions from dense image descriptions, then extracts simple CoTs from VLMs for those verifiable problems, and finally expands those simple thoughts to elaborate long thoughts via frontier reasoning models. In controlled experiments with a strong instruction-tuned 7B model, we demonstrate notable improvements over existing visual reasoning data-generation methods. Our model, trained on the generated dataset, achieves an average +3.4 points improvement over 5 vision-centric benchmarks, including +11.8 points on V Bench. Notably, despite being tuned for vision tasks, it also improves performance on the text reasoning benchmark, MMLU-Pro, by +2 points.

Paper Structure

This paper contains 28 sections, 4 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: LongPerceptualThoughts is a new synthetic dataset with 30K long-thought traces for vision-centric tasks. Each trace contains diverse cognitive behaviors (e.g., verification, subgoal setting, and backtracking), akin to system-2 reasoning. CoTs generated by open-source VLMs often produce linear, rigid reasoning traces (top). In contrast, our novel data synthesis framework effectively expands these simple thoughts using frontier reasoning models, equipping VLMs with complex reasoning structures and rich cognitive behaviors—effectively distilling system-2 reasoning into instruction-tuned VLMs.
  • Figure 2: Ask, Think, and Think Harder: The three stages to synthesize long CoT data for vision-centric tasks. Assuming the access to an image and its associated dense caption, we first ask an LLM to convert dense captions to multiple-choice questions. In Stage 2, we extract simple CoT from VLM. These simple CoTs typically exhibits shallow and rigid reasoning, especially in vision-centric tasks. Therefore, in Stage 3, we precondition a reasoning LLM with these simple CoTs and append a subtle cue, e.g., "Wait,", to elicit more diverse long CoTs.
  • Figure 3: (a) Analysis of Cognitive Behaviors in Chain-of-Thought (CoT). CoTs from open-source VLMs often follow rigid structures. In contrast, frontier reasoning VLMs—such as Gemini 2.0 Flash Thinking—exhibit more diverse cognitive behaviors, including subgoal setting, backtracking, and verification. Our introduced long CoT dataset, LongPerceptualThoughts, also demonstrates a wide range of such behaviors. (b) Length of CoTs. The CoTs in LongPerceptualThoughts are significantly longer than those generated by popular VLMs, e.g. Qwen2.5-VL. (c) Response length vs. aggregated performances. Fine-tuning VLM on LongPerceptualThoughts with complex reasoning structures lead to higher overall performances with slightly more output tokens. On the other hand, fine-tuning on other multimodal reasoning leads to over-thinking and worse performance. Cognitive behaviors are quantified following gandhi2025cognitivebehaviorsenableselfimproving.
  • Figure 4: Response lengths vs. question difficulties. We analyze the responses of the VLM fine-tuned on LongPerceptualThoughts via DPO. Interestingly, we find that the model finetuned in our data naturally allocates more test-time compute for hard questions. We follow lightman2024letssnell2025scaling and determine question complexity using rollouts on the base model.
  • Figure 5: Text prompt converting descriptions to multi-choices questions.
  • ...and 5 more figures