Table of Contents
Fetching ...

SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning

Fangxun Shu, Yongjie Ye, Yue Liao, Zijian Kang, Weijie Yin, Jiacong Wang, Xiao Liang, Shuicheng Yan, Chao Feng

TL;DR

SAIL-RL tackles two core problems in reinforcement learning post-training for multimodal LLMs: outcome-only supervision and non-adaptive thinking. It introduces Thinking Reward and Judging Reward to supervise what to think and when to think, implemented via a two-stage pipeline consisting of LongCoT SFT and RL tuning on SAIL-VL2. The approach achieves state-of-the-art results among open-source models at 8B and competitive performance with GPT-4o and Gemini-2. Substantial reductions in hallucinations and improved efficiency demonstrate a principled path toward more reliable and adaptive multimodal reasoning systems, with data pipelines and code made available for replication.

Abstract

We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at https://github.com/BytedanceDouyinContent/SAIL-RL.

SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning

TL;DR

SAIL-RL tackles two core problems in reinforcement learning post-training for multimodal LLMs: outcome-only supervision and non-adaptive thinking. It introduces Thinking Reward and Judging Reward to supervise what to think and when to think, implemented via a two-stage pipeline consisting of LongCoT SFT and RL tuning on SAIL-VL2. The approach achieves state-of-the-art results among open-source models at 8B and competitive performance with GPT-4o and Gemini-2. Substantial reductions in hallucinations and improved efficiency demonstrate a principled path toward more reliable and adaptive multimodal reasoning systems, with data pipelines and code made available for replication.

Abstract

We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at https://github.com/BytedanceDouyinContent/SAIL-RL.

Paper Structure

This paper contains 26 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 2: Limitations of current MLLMs in reasoning. Left: Lucky success where the model reaches the correct answer through a flawed reasoning process. Right: Overthinking where the model applies a needlessly complex reasoning process to a simple problem, resulting in an incorrect answer.
  • Figure 3: An overview of the SAIL-RL's multi-dimensional reward system. The system evaluates a model's response across four dimensions: Format, Answer, Thinking, and Judging. The nuanced semantic rewards for Thinking and Judging are provided by Gemini acting as a reward-judger.
  • Figure 4: Evaluation results on thinking trigger.
  • Figure 5: Ablation on training dynamics of thinking reward. Our method (blue) consistently improves all three thinking score over the answer-only baseline (orange), which stagnates or degrades.
  • Figure 6: Visualizing behavior on an OCR task under two different reasoning strategies. Orange: The output from a baseline that is forced to think. Blue: The output from our model guided by the proposed Judge Reward, which dynamically decides when to think.
  • ...and 1 more figures