Table of Contents
Fetching ...

Learning to Think Fast and Slow for Visual Language Models

Chenyu Lin, Cheng Chi, Jinlin Wu, Sharon Li, Kaiyang Zhou

TL;DR

DualMindVLM introduces a dual-mode visual language model that automatically toggles between fast (System 1) and slow (System 2) thinking to handle simple versus complex visual reasoning tasks efficiently. The method combines a data-driven thinking-mode auto-labeling stage with a reinforcement-learning-based dual-mode training (GRPO), incorporating prefix-guided and free-form rollouts to learn when to apply fast or slow thinking. Empirical results across six multimodal benchmarks show that DualMindVLM achieves competitive accuracy while substantially reducing token usage compared to state-of-the-art reasoning models, and exhibits reduced hallucinations. The work highlights the value of adaptive, multi-speed reasoning in multimodal systems and points to data-centric RL and bias mitigation as avenues for future improvement.

Abstract

When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.

Learning to Think Fast and Slow for Visual Language Models

TL;DR

DualMindVLM introduces a dual-mode visual language model that automatically toggles between fast (System 1) and slow (System 2) thinking to handle simple versus complex visual reasoning tasks efficiently. The method combines a data-driven thinking-mode auto-labeling stage with a reinforcement-learning-based dual-mode training (GRPO), incorporating prefix-guided and free-form rollouts to learn when to apply fast or slow thinking. Empirical results across six multimodal benchmarks show that DualMindVLM achieves competitive accuracy while substantially reducing token usage compared to state-of-the-art reasoning models, and exhibits reduced hallucinations. The work highlights the value of adaptive, multi-speed reasoning in multimodal systems and points to data-centric RL and bias mitigation as avenues for future improvement.

Abstract

When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.

Paper Structure

This paper contains 36 sections, 4 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: Comparison among the base model, the GRPO model and our DualMindVLM. For simple queries, the GRPO model tends to produce unnecessarily long responses, leading to additional computational overhead for questions that the base model can already handle efficiently. In contrast, our model adaptively balances response length by maintaining concise answers for simple queries and engaging in detailed reasoning for complex ones through two automatically selected modes of thinking.
  • Figure 2: Accuracy vs. token budgets. Under the same token budget, DualMindVLM performs favorably against other models.
  • Figure 3: Overview of DualMindVLM. (a) For each VQA pair, we annotate its thinking mode based on the base model’s response length and discard samples for which all responses are correct or incorrect (to avoid zero relative advantage in GRPO training). (b) During GRPO, the thinking mode label is used to guide the generation of a group of candidate responses, while the other group of responses are generated using the model's own judgment. A group-wise advantage is computed using all candidate responses to update the model.
  • Figure 4: Average response lengths of a pre-trained general-purpose VLM across a variety of VQA tasks. The simpler the question, the shorter the response. The harder the question, the longer the response. These insights are indicative of task difficulty.
  • Figure 5: System prompt for dual-mode RL training.
  • ...and 14 more figures