Table of Contents
Fetching ...

From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training

Donglai Xu, Hongzheng Yang, Yuzhi Zhao, Pingping Zhang, Jinpeng Chen, Wenao Ma, Zhijian Hou, Mengyang Wu, Xiaolei Li, Senkang Hu, Ziyi Guan, Jason Chun Lok Li, Lai Man Po

TL;DR

The paper tackles RLVR for multimodal LLMs under annotation noise by proposing a two-stage token-level entropy-guided GRPO that transitions from entropy maximization to entropy minimization to realize exploration-to-exploitation. It formalizes token-level entropy and a scheduling strategy, integrates it with GRPO, and demonstrates robustness across multiple backbones and tasks, including GUI grounding, fine-grained classification, and open-vocabulary detection with noisy labels. Key findings show substantial gains over baselines, including improved accuracy under high noise and better out-of-distribution generalization, with switching points around 80% of training. The work highlights the practical potential of entropy-aware policy optimization for learning from imperfect, real-world data in multimodal reasoning systems.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) for Multimodal Large Language Models (MLLMs) is highly dependent on high-quality labeled data, which is often scarce and prone to substantial annotation noise in real-world scenarios. Existing unsupervised RLVR methods, including pure entropy minimization, can overfit to incorrect labels and limit the crucial reward ranking signal for Group-Relative Policy Optimization (GRPO). To address these challenges and enhance noise tolerance, we propose a novel two-stage, token-level entropy optimization method for RLVR. This approach dynamically guides the model from exploration to exploitation during training. In the initial exploration phase, token-level entropy maximization promotes diverse and stochastic output generation, serving as a strong regularizer that prevents premature convergence to noisy labels and ensures sufficient intra-group variation, which enables more reliable reward gradient estimation in GRPO. As training progresses, the method transitions into the exploitation phase, where token-level entropy minimization encourages the model to produce confident and deterministic outputs, thereby consolidating acquired knowledge and refining prediction accuracy. Empirically, across three MLLM backbones - Qwen2-VL-2B, Qwen2-VL-7B, and Qwen2.5-VL-3B - spanning diverse noise settings and multiple tasks, our phased strategy consistently outperforms prior approaches by unifying and enhancing external, internal, and entropy-based methods, delivering robust and superior performance across the board.

From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training

TL;DR

The paper tackles RLVR for multimodal LLMs under annotation noise by proposing a two-stage token-level entropy-guided GRPO that transitions from entropy maximization to entropy minimization to realize exploration-to-exploitation. It formalizes token-level entropy and a scheduling strategy, integrates it with GRPO, and demonstrates robustness across multiple backbones and tasks, including GUI grounding, fine-grained classification, and open-vocabulary detection with noisy labels. Key findings show substantial gains over baselines, including improved accuracy under high noise and better out-of-distribution generalization, with switching points around 80% of training. The work highlights the practical potential of entropy-aware policy optimization for learning from imperfect, real-world data in multimodal reasoning systems.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) for Multimodal Large Language Models (MLLMs) is highly dependent on high-quality labeled data, which is often scarce and prone to substantial annotation noise in real-world scenarios. Existing unsupervised RLVR methods, including pure entropy minimization, can overfit to incorrect labels and limit the crucial reward ranking signal for Group-Relative Policy Optimization (GRPO). To address these challenges and enhance noise tolerance, we propose a novel two-stage, token-level entropy optimization method for RLVR. This approach dynamically guides the model from exploration to exploitation during training. In the initial exploration phase, token-level entropy maximization promotes diverse and stochastic output generation, serving as a strong regularizer that prevents premature convergence to noisy labels and ensures sufficient intra-group variation, which enables more reliable reward gradient estimation in GRPO. As training progresses, the method transitions into the exploitation phase, where token-level entropy minimization encourages the model to produce confident and deterministic outputs, thereby consolidating acquired knowledge and refining prediction accuracy. Empirically, across three MLLM backbones - Qwen2-VL-2B, Qwen2-VL-7B, and Qwen2.5-VL-3B - spanning diverse noise settings and multiple tasks, our phased strategy consistently outperforms prior approaches by unifying and enhancing external, internal, and entropy-based methods, delivering robust and superior performance across the board.

Paper Structure

This paper contains 16 sections, 8 equations, 3 figures, 8 tables, 1 algorithm.

Figures (3)

  • Figure 1: ScreenSpot accuracy after 1000 steps of different training strategies on Qwen2.5-VL-3B model. The horizontal axis includes different training data configurations. The proposed two-stage entropy-guided RLVR training method (GRPO w. Two.) performs better than one-shot RL gao2025oneshotentropyminimization, RLVR with “spurious rewards” (including format reward and random reward) shao2025spuriousrewardsrethinkingtraining, and RLVR with pure entropy minimization or maximization zhang2025right, even on fully wrongly labeled training data. For instance, the trained model attains 5.2% gain compared with it before RL, on 500 wrongly labeled data. The proposed method obtains consistently improvements on different annotation rates (0%, 20%, 50%, 100%).
  • Figure 2: Qualitative effect of entropy scheduling on the GUI grounding task. We visualise the reasoning trace (〈think〉…〈/think〉) and predicted coordinate produced by: GRPO, GRPO with entropy minimization, GRPO with entropy maximization, and GRPO with two-stage entropy optimization. The ground-truth bounding box is outlined in red on the image.
  • Figure 3: (a). Comparison of token-level entropy dynamics during training with 100% noise; (b) Comparison of test score at each training step during training with 100% noise. We compare 4 strategies: standard GRPO, GRPO with entropy maximization, GRPO with entropy minimization, and GRPO with two-stage entropy optimization.