Table of Contents
Fetching ...

Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization

Jinghan Li, Junfeng Fang, Jinda Lu, Yuan Wang, Xiaoyan Guo, Tianyu Zhang, Xiang Wang, Xiangnan He

TL;DR

This work proposes difficulty-aware group normalization (Durian), which re-groups samples by difficulty levels and shares the std within each group, yielding significant performance gains across multiple multimodal reasoning benchmarks.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) have significantly advanced the reasoning capabilities of large language models. Extending these methods to multimodal settings, however, faces a critical challenge: the instability of std-based normalization, which is easily distorted by extreme samples with nearly positive or negative rewards. Unlike pure-text LLMs, multimodal models are particularly sensitive to such distortions, as both perceptual and reasoning errors influence their responses. To address this, we characterize each sample by its difficulty, defined through perceptual complexity (measured via visual entropy) and reasoning uncertainty (captured by model confidence). Building on this characterization, we propose difficulty-aware group normalization (Durian), which re-groups samples by difficulty levels and shares the std within each group. Our approach preserves GRPO's intra-group distinctions while eliminating sensitivity to extreme cases, yielding significant performance gains across multiple multimodal reasoning benchmarks.

Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization

TL;DR

This work proposes difficulty-aware group normalization (Durian), which re-groups samples by difficulty levels and shares the std within each group, yielding significant performance gains across multiple multimodal reasoning benchmarks.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) have significantly advanced the reasoning capabilities of large language models. Extending these methods to multimodal settings, however, faces a critical challenge: the instability of std-based normalization, which is easily distorted by extreme samples with nearly positive or negative rewards. Unlike pure-text LLMs, multimodal models are particularly sensitive to such distortions, as both perceptual and reasoning errors influence their responses. To address this, we characterize each sample by its difficulty, defined through perceptual complexity (measured via visual entropy) and reasoning uncertainty (captured by model confidence). Building on this characterization, we propose difficulty-aware group normalization (Durian), which re-groups samples by difficulty levels and shares the std within each group. Our approach preserves GRPO's intra-group distinctions while eliminating sensitivity to extreme cases, yielding significant performance gains across multiple multimodal reasoning benchmarks.
Paper Structure (32 sections, 24 equations, 10 figures, 7 tables)

This paper contains 32 sections, 24 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: The advantage distribution after the normalization of reward varies among samples. Extreme samples like easy and hard ones are amplified after std-normalization, whereas medium samples exhibit more balanced advantages.
  • Figure 2: Overview of two difficulty-based regrouping strategies of Durian. Upper: For perceptual difficulty, we extract image patch features through the visual encoder and compute patch covariance matrices, whose eigenvalue entropy characterizes visual complexity. Bottom: For reasoning difficulty, model confidence is estimated from normalized sequence-level log probabilities across multiple rollouts. In both strategies, samples in the same group share the same std.
  • Figure 3: Acc Improvements of two re-grouping strategies over Qwen2.5-VL. We take DAPO as our backbone.
  • Figure 4: The distribution of pre-calculated entropy on the Geometry3K. $x$ axis represents entropy, $y$ axis is the probability density, Q25 and Q75 denotes 25th and 75th percentiles, respectively .
  • Figure 5: Illustrative examples of different levels of entropy
  • ...and 5 more figures