Table of Contents
Fetching ...

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Jiaqi Wang, Kevin Qinghong Lin, James Cheng, Mike Zheng Shou

TL;DR

TON tackles inefficiency in vision-language model reasoning by teaching models when to think. It introduces a two-stage approach: Thought Dropout during supervised fine-tuning to bootstrap a skip-thought capability, followed by Group Relative Policy Optimization to learn when to activate thinking during RL. Across GSM8K, CLEVR, GeoQA, and AITZ with 3B/7B models, TON reduces completion length by up to $90\%$ while achieving comparable or improved accuracy, demonstrating adaptive, human-like selective reasoning. This work provides a practical and scalable path to more efficient multimodal reasoning in RL-enabled VLMs, with publicly available code.

Abstract

Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across LLM (GSM8K), VLM (CLEVR, Super-CLEVR, GeoQA), and Agentic (AITZ) tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in RL approaches. Our code is available at https://github.com/kokolerk/TON.

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

TL;DR

TON tackles inefficiency in vision-language model reasoning by teaching models when to think. It introduces a two-stage approach: Thought Dropout during supervised fine-tuning to bootstrap a skip-thought capability, followed by Group Relative Policy Optimization to learn when to activate thinking during RL. Across GSM8K, CLEVR, GeoQA, and AITZ with 3B/7B models, TON reduces completion length by up to while achieving comparable or improved accuracy, demonstrating adaptive, human-like selective reasoning. This work provides a practical and scalable path to more efficient multimodal reasoning in RL-enabled VLMs, with publicly available code.

Abstract

Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across LLM (GSM8K), VLM (CLEVR, Super-CLEVR, GeoQA), and Agentic (AITZ) tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in RL approaches. Our code is available at https://github.com/kokolerk/TON.

Paper Structure

This paper contains 34 sections, 6 equations, 30 figures, 11 tables, 1 algorithm.

Figures (30)

  • Figure 1: Illustrating the “think or not think” trade-off.Left: For simple queries, explicit reasoning is unnecessary—models like GRPO that always "think" incur redundant computation. Right: For more complex geometric problems, step-by-step reasoning is essential to arrive at the correct answer. Our proposed TON framework learns to adaptively invoke reasoning only when needed—skipping it for easy cases while engaging in deeper inference for harder tasks.
  • Figure 2: Accuracy comparison of with v.s. without “thinking” during SFT using Qwen-2.5-VL-3B on the AITZ task.
  • Figure 3: Illustration of the responses from GRPO and TON.$q_1$ is the question and $\{o_1, \cdots, o_5\}$ are the generated responses containing thoughts $\mathcal{T}$ (circle) and answers $\mathcal{S}$ (triangle). TON can sample from the empty think $\mathcal{T}_{\textbackslash n \textbackslash n}$, thus enhancing the response diversity over the vanilla GRPO.
  • Figure 4: Training metrics comparison between TON and GRPO on GeoQA. (a) Training rewards, (b) Completion length over training steps, (c) Ratio of non-think outputs to total samples at each step for TON, and (d) Average completion length of think outputs across training.
  • Figure 5: Further Analysis of TON on the AITZ benchmark. (a)(b)(c) is the average completion length, skip thought ratios, and the reward under different dropout probabilities. (d) Prompting (hybrid) does not reduce the completion length, while TON using SFT can effectively reduce it.
  • ...and 25 more figures