Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Jiaqi Wang; Kevin Qinghong Lin; James Cheng; Mike Zheng Shou

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Jiaqi Wang, Kevin Qinghong Lin, James Cheng, Mike Zheng Shou

TL;DR

TON tackles inefficiency in vision-language model reasoning by teaching models when to think. It introduces a two-stage approach: Thought Dropout during supervised fine-tuning to bootstrap a skip-thought capability, followed by Group Relative Policy Optimization to learn when to activate thinking during RL. Across GSM8K, CLEVR, GeoQA, and AITZ with 3B/7B models, TON reduces completion length by up to $90\%$ while achieving comparable or improved accuracy, demonstrating adaptive, human-like selective reasoning. This work provides a practical and scalable path to more efficient multimodal reasoning in RL-enabled VLMs, with publicly available code.

Abstract

Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across LLM (GSM8K), VLM (CLEVR, Super-CLEVR, GeoQA), and Agentic (AITZ) tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in RL approaches. Our code is available at https://github.com/kokolerk/TON.

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

TL;DR

Abstract

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (30)