Table of Contents
Fetching ...

MAPLE: Modality-Aware Post-training and Learning Ecosystem

Nikhil Verma, Minjung Kim, JooYoung Yoo, Kyung-Min Jin, Manasa Bharadwaj, Kevin Ferreira, Ko Keun Kim, Youngjoon Kim

TL;DR

MAPLE tackles the key problem of modality-blind RL post-training in multimodal language models by introducing a modality-aware ecosystem comprising MAPLE-bench, MAPO, and adaptive training strategies. MAPLE-bench provides a modality-tagged benchmark with Required Modality Tags to enable sampling, evaluation, and robust analysis across uni-, bi-, and tri-modal tasks for QA and captioning. MAPO stabilizes training by stratifying data by modality requirement and by using adaptive KL-based weighting and curriculum to balance learning across difficult and easy modality regimes, with thorough ablations showing the value of sample-level loss, asymmetric clipping, early filtering, and curriculum. Across QA and captioning, MAPLE achieves substantial reductions in uni/multi-modal gaps and faster convergence, including significant fusion gains under adaptive strategies and robustness tests such as MAPLE-QA+ and CRW, illustrating practical deployment benefits under partial-signal conditions.

Abstract

Multimodal language models now integrate text, audio, and video for unified reasoning. Yet existing RL post-training pipelines treat all input signals as equally relevant, ignoring which modalities each task actually requires. This modality-blind training inflates policy-gradient variance, slows convergence, and degrades robustness to real-world distribution shifts where signals may be missing, added, or reweighted. We introduce MAPLE, a complete modality-aware post-training and learning ecosystem comprising: (1) MAPLE-bench, the first benchmark explicitly annotating minimal signal combinations required per task; (2) MAPO, a modality-aware policy optimization framework that stratifies batches by modality requirement to reduce gradient variance from heterogeneous group advantages; (3) Adaptive weighting and curriculum scheduling that balances and prioritizes harder signal combinations. Systematic analysis across loss aggregation, clipping, sampling, and curriculum design establishes MAPO's optimal training strategy. Adaptive weighting and curriculum focused learning further boost performance across signal combinations. MAPLE narrows uni/multi-modal accuracy gaps by 30.24%, converges 3.18x faster, and maintains stability across all modality combinations under realistic reduced signal access. MAPLE constitutes a complete recipe for deployment-ready multimodal RL post-training.

MAPLE: Modality-Aware Post-training and Learning Ecosystem

TL;DR

MAPLE tackles the key problem of modality-blind RL post-training in multimodal language models by introducing a modality-aware ecosystem comprising MAPLE-bench, MAPO, and adaptive training strategies. MAPLE-bench provides a modality-tagged benchmark with Required Modality Tags to enable sampling, evaluation, and robust analysis across uni-, bi-, and tri-modal tasks for QA and captioning. MAPO stabilizes training by stratifying data by modality requirement and by using adaptive KL-based weighting and curriculum to balance learning across difficult and easy modality regimes, with thorough ablations showing the value of sample-level loss, asymmetric clipping, early filtering, and curriculum. Across QA and captioning, MAPLE achieves substantial reductions in uni/multi-modal gaps and faster convergence, including significant fusion gains under adaptive strategies and robustness tests such as MAPLE-QA+ and CRW, illustrating practical deployment benefits under partial-signal conditions.

Abstract

Multimodal language models now integrate text, audio, and video for unified reasoning. Yet existing RL post-training pipelines treat all input signals as equally relevant, ignoring which modalities each task actually requires. This modality-blind training inflates policy-gradient variance, slows convergence, and degrades robustness to real-world distribution shifts where signals may be missing, added, or reweighted. We introduce MAPLE, a complete modality-aware post-training and learning ecosystem comprising: (1) MAPLE-bench, the first benchmark explicitly annotating minimal signal combinations required per task; (2) MAPO, a modality-aware policy optimization framework that stratifies batches by modality requirement to reduce gradient variance from heterogeneous group advantages; (3) Adaptive weighting and curriculum scheduling that balances and prioritizes harder signal combinations. Systematic analysis across loss aggregation, clipping, sampling, and curriculum design establishes MAPO's optimal training strategy. Adaptive weighting and curriculum focused learning further boost performance across signal combinations. MAPLE narrows uni/multi-modal accuracy gaps by 30.24%, converges 3.18x faster, and maintains stability across all modality combinations under realistic reduced signal access. MAPLE constitutes a complete recipe for deployment-ready multimodal RL post-training.
Paper Structure (58 sections, 28 equations, 16 figures, 8 tables, 2 algorithms)

This paper contains 58 sections, 28 equations, 16 figures, 8 tables, 2 algorithms.

Figures (16)

  • Figure 1: Heterogeneous multimodal latent geometry needs stratified RL. Unimodal clusters (V: video, A: audio, S: subtitle) connect through bi-modal (VA, VS, AS) and tri-modal (VAS) manifolds. Modality-blind RL mixes trajectories from these 7 distinct regions indiscriminately, inflating gradient variance across heterogeneous reward scales. MAPLE's stratified training assigns each trajectory to its native signal combination for stable optimization.
  • Figure 2: Overview of the MAPLE-bench curation pipeline. Seed videos are sampled from large-scale corpora and decomposed into audio, visual, and textual streams. Each modality undergoes independent annotation and cross-modal consistency alignment before generating captions and query–response pairs. A modality-isolated prompting stage then synthesizes tasks across uni-, bi-, and tri-modal configurations, producing a balanced dataset for modality-aware post-training.
  • Figure 3: MAPLE algorithmic variations across four design axes: (a) Sample-level loss aggregation outperforms others, (b) Asymmetric clipping maintains low clip fractions, (c) Dynamic filtering accelerates reward gains and (d) Curriculum improves accuracy by modality complexity.
  • Figure 4: Tag-wise disparity heatmaps across MAPLE-QA-eval. Left: Basic MAPO reduces cross-tag variance. Right: Full adaptive MAPO ($adp_w + adp_{\text{cur}}$) achieves minimal disparity (smaller values better).
  • Figure 5: Tag-wise disparity heatmaps across MAPLE-Caption-eval. Left: Basic MAPO reduces cross-tag variance. Right: Full adaptive MAPO ($adp_w + adp_{\text{cur}}$) achieves minimal disparity (smaller values better).
  • ...and 11 more figures