Learning to Think Fast and Slow for Visual Language Models

Chenyu Lin; Cheng Chi; Jinlin Wu; Sharon Li; Kaiyang Zhou

Learning to Think Fast and Slow for Visual Language Models

Chenyu Lin, Cheng Chi, Jinlin Wu, Sharon Li, Kaiyang Zhou

TL;DR

DualMindVLM introduces a dual-mode visual language model that automatically toggles between fast (System 1) and slow (System 2) thinking to handle simple versus complex visual reasoning tasks efficiently. The method combines a data-driven thinking-mode auto-labeling stage with a reinforcement-learning-based dual-mode training (GRPO), incorporating prefix-guided and free-form rollouts to learn when to apply fast or slow thinking. Empirical results across six multimodal benchmarks show that DualMindVLM achieves competitive accuracy while substantially reducing token usage compared to state-of-the-art reasoning models, and exhibits reduced hallucinations. The work highlights the value of adaptive, multi-speed reasoning in multimodal systems and points to data-centric RL and bias mitigation as avenues for future improvement.

Abstract

When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.

Learning to Think Fast and Slow for Visual Language Models

TL;DR

Abstract

Learning to Think Fast and Slow for Visual Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (19)