Table of Contents
Fetching ...

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia

TL;DR

<3-5 sentence high-level summary> VisionThink tackles the inefficiency of vision-language models caused by excessive visual tokens by introducing an adaptive, RL-driven pipeline that starts from downsampled images and selectively requests high-resolution inputs. It formalizes an LLM-as-Judge framework to evaluate general VQA outputs with discrete rewards and extends GRPO to a multi-turn setting, enabling the model to decide when higher detail is necessary. A carefully designed reward-penalty scheme balances accuracy, formatting, and high-resolution calls to avoid pathological behavior. Empirical results across OCR-heavy benchmarks and general VQA tasks show VisionThink achieves strong accuracy while significantly improving efficiency compared with baselines and prior efficient VLMs.

Abstract

Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

TL;DR

<3-5 sentence high-level summary> VisionThink tackles the inefficiency of vision-language models caused by excessive visual tokens by introducing an adaptive, RL-driven pipeline that starts from downsampled images and selectively requests high-resolution inputs. It formalizes an LLM-as-Judge framework to evaluate general VQA outputs with discrete rewards and extends GRPO to a multi-turn setting, enabling the model to decide when higher detail is necessary. A carefully designed reward-penalty scheme balances accuracy, formatting, and high-resolution calls to avoid pathological behavior. Empirical results across OCR-heavy benchmarks and general VQA tasks show VisionThink achieves strong accuracy while significantly improving efficiency compared with baselines and prior efficient VLMs.

Abstract

Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.

Paper Structure

This paper contains 40 sections, 8 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Our key observations and VisionThink performance and efficiency. Left: We find that in most general scenarios, even reducing visual tokens by a factor of four results in only minimal performance drop. However, token compression leads to a significant performance drop on strong OCR-related benchmarks. Right: Our VisionThink significantly outperforms previous work in both performance and efficiency.
  • Figure 2: Framework of VisionThink. (a) The left image illustrates VisionThink processing an image with resolution reduced by a factor of four, where the VLM directly provides an answer. (b) The right image shows a case where the model detects insufficient information and requests a high-resolution image to answer the question.
  • Figure 3: (a) Impact of the Penalty Ratio. Applying a penalty to all resize image requests or removing the penalty entirely will both lead to model collapse. (b) VisionThink correctly solves OCR-related problems by autonomously requesting high-resolution images.
  • Figure 4: Inference Time Cost and Benchmark Performance Comparison for Reasoning Model. Qwen-RL and Qwen-RL (1/4) represent leveraging the LLM-as-Judge on the Qwen2.5-VL-Instruct Model and inference on full resolution image and 1/4 resolution image, respectively.
  • Figure 5: VisionThink smartly determine the high-resolution image ratio. Apply Resize indicates that the model autonomously requests to view the original high-resolution image, while Direct Answer indicates that the model is able to answer the question using only the 1/4-sized image.
  • ...and 6 more figures