Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization

Jinghan Li; Junfeng Fang; Jinda Lu; Yuan Wang; Xiaoyan Guo; Tianyu Zhang; Xiang Wang; Xiangnan He

Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization

Jinghan Li, Junfeng Fang, Jinda Lu, Yuan Wang, Xiaoyan Guo, Tianyu Zhang, Xiang Wang, Xiangnan He

TL;DR

This work proposes difficulty-aware group normalization (Durian), which re-groups samples by difficulty levels and shares the std within each group, yielding significant performance gains across multiple multimodal reasoning benchmarks.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) have significantly advanced the reasoning capabilities of large language models. Extending these methods to multimodal settings, however, faces a critical challenge: the instability of std-based normalization, which is easily distorted by extreme samples with nearly positive or negative rewards. Unlike pure-text LLMs, multimodal models are particularly sensitive to such distortions, as both perceptual and reasoning errors influence their responses. To address this, we characterize each sample by its difficulty, defined through perceptual complexity (measured via visual entropy) and reasoning uncertainty (captured by model confidence). Building on this characterization, we propose difficulty-aware group normalization (Durian), which re-groups samples by difficulty levels and shares the std within each group. Our approach preserves GRPO's intra-group distinctions while eliminating sensitivity to extreme cases, yielding significant performance gains across multiple multimodal reasoning benchmarks.

Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization

TL;DR

Abstract

Paper Structure (32 sections, 24 equations, 10 figures, 7 tables)

This paper contains 32 sections, 24 equations, 10 figures, 7 tables.

Introduction
Preliminary
Task Formulation
Core Algorithms of Reinforcement Learning with Verifiable Reward
Durian: Difficulty-based Regrouping
Perceptual Difficulty-based Regrpouping
Reasoning Difficulty-based Regrouping
Combination for Robust Optimization
Experiment
Experimental Settings
Comparison with Baseline Methods (RQ1)
Ablation Studies (RQ2)
Hyper-parameter Sensitivity Analysis (RQ3)
Groups under Perceptual difficulty-based strategy
Groups under Reasoning difficulty-based strategy
...and 17 more sections

Figures (10)

Figure 1: The advantage distribution after the normalization of reward varies among samples. Extreme samples like easy and hard ones are amplified after std-normalization, whereas medium samples exhibit more balanced advantages.
Figure 2: Overview of two difficulty-based regrouping strategies of Durian. Upper: For perceptual difficulty, we extract image patch features through the visual encoder and compute patch covariance matrices, whose eigenvalue entropy characterizes visual complexity. Bottom: For reasoning difficulty, model confidence is estimated from normalized sequence-level log probabilities across multiple rollouts. In both strategies, samples in the same group share the same std.
Figure 3: Acc Improvements of two re-grouping strategies over Qwen2.5-VL. We take DAPO as our backbone.
Figure 4: The distribution of pre-calculated entropy on the Geometry3K. $x$ axis represents entropy, $y$ axis is the probability density, Q25 and Q75 denotes 25th and 75th percentiles, respectively .
Figure 5: Illustrative examples of different levels of entropy
...and 5 more figures

Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization

TL;DR

Abstract

Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization

Authors

TL;DR

Abstract

Table of Contents

Figures (10)