Table of Contents
Fetching ...

Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View

Jianyu Qi, Ding Zou, Wenrui Yan, Rui Ma, Jiaxu Li, Zhijie Zheng, Zhiguo Yang, Rongchang Zhao

TL;DR

This work tackles the lack of quantifiable sample hardness in multimodal post-training by introducing two difficulty-aware metrics: Progressive Image Semantic Masking (PISM) for visual sensitivity and Cross-Modality Attention Balance (CMAB) for cross-modal interaction. A hierarchical training framework is then explored, comparing GRPO-only against SFT+GRPO across six benchmarks, and guided by the two metrics to select mid and hard samples. Across perception and reasoning tasks, difficulty-stratified GRPO-only training consistently outperforms SFT+GRPO, reducing reliance on supervised templates and mitigating pseudo-CoT patterns. The findings imply that intelligent data selection can surpass traditional multi-stage pipelines, offering a simpler, more robust route to effective multimodal alignment and reasoning, with practical impact for deploying capable MLLMs without heavy supervised fine-tuning. $\tau=0.1$, $\lambda_{hard}=0.4$, $\lambda_{easy}=0.7$, and $\rho_t$ balance thresholds are used to categorize samples and guide learning.$

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have spurred significant progress in Chain-of-Thought (CoT) reasoning. Building on the success of Deepseek-R1, researchers extended multimodal reasoning to post-training paradigms based on reinforcement learning (RL), focusing predominantly on mathematical datasets. However, existing post-training paradigms tend to neglect two critical aspects: (1) The lack of quantifiable difficulty metrics capable of strategically screening samples for post-training optimization. (2) Suboptimal post-training paradigms that fail to jointly optimize perception and reasoning capabilities. To address this gap, we propose two novel difficulty-aware sampling strategies: Progressive Image Semantic Masking (PISM) quantifies sample hardness through systematic image degradation, while Cross-Modality Attention Balance (CMAB) assesses cross-modal interaction complexity via attention distribution analysis. Leveraging these metrics, we design a hierarchical training framework that incorporates both GRPO-only and SFT+GRPO hybrid training paradigms, and evaluate them across six benchmark datasets. Experiments demonstrate consistent superiority of GRPO applied to difficulty-stratified samples compared to conventional SFT+GRPO pipelines, indicating that strategic data sampling can obviate the need for supervised fine-tuning while improving model accuracy. Our code will be released at https://github.com/qijianyu277/DifficultySampling.

Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View

TL;DR

This work tackles the lack of quantifiable sample hardness in multimodal post-training by introducing two difficulty-aware metrics: Progressive Image Semantic Masking (PISM) for visual sensitivity and Cross-Modality Attention Balance (CMAB) for cross-modal interaction. A hierarchical training framework is then explored, comparing GRPO-only against SFT+GRPO across six benchmarks, and guided by the two metrics to select mid and hard samples. Across perception and reasoning tasks, difficulty-stratified GRPO-only training consistently outperforms SFT+GRPO, reducing reliance on supervised templates and mitigating pseudo-CoT patterns. The findings imply that intelligent data selection can surpass traditional multi-stage pipelines, offering a simpler, more robust route to effective multimodal alignment and reasoning, with practical impact for deploying capable MLLMs without heavy supervised fine-tuning. , , , and balance thresholds are used to categorize samples and guide learning.$

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have spurred significant progress in Chain-of-Thought (CoT) reasoning. Building on the success of Deepseek-R1, researchers extended multimodal reasoning to post-training paradigms based on reinforcement learning (RL), focusing predominantly on mathematical datasets. However, existing post-training paradigms tend to neglect two critical aspects: (1) The lack of quantifiable difficulty metrics capable of strategically screening samples for post-training optimization. (2) Suboptimal post-training paradigms that fail to jointly optimize perception and reasoning capabilities. To address this gap, we propose two novel difficulty-aware sampling strategies: Progressive Image Semantic Masking (PISM) quantifies sample hardness through systematic image degradation, while Cross-Modality Attention Balance (CMAB) assesses cross-modal interaction complexity via attention distribution analysis. Leveraging these metrics, we design a hierarchical training framework that incorporates both GRPO-only and SFT+GRPO hybrid training paradigms, and evaluate them across six benchmark datasets. Experiments demonstrate consistent superiority of GRPO applied to difficulty-stratified samples compared to conventional SFT+GRPO pipelines, indicating that strategic data sampling can obviate the need for supervised fine-tuning while improving model accuracy. Our code will be released at https://github.com/qijianyu277/DifficultySampling.

Paper Structure

This paper contains 19 sections, 4 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Illustration of the PISM (Progressive Image Semantic Masking) method. We progressively mask different portions of the image, from no masking ($mask\_ratio=0.0$) to heavy masking ($mask\_ratio > 0.7$). Each masked image is created by randomly hiding a certain percentage of pixels. The process simulates varying levels of visual information loss. The model's performance is then evaluated on these masked images to understand how much it relies on visual details for accurate reasoning.
  • Figure 2: Illustration of the CMAB (Cross-Modality Attention Balance) method. For each generated token, we calculate its average attention score over the input text tokens and image tokens across all transformer layers, and then average these scores across all generated tokens. $N$ represents the total number of layers of the transformer.