Table of Contents
Fetching ...

MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning

Xuhui Zheng, Kang An, Ziliang Wang, Yuhang Wang, Faqiang Qian, Yichao Wu

TL;DR

The paper tackles the limitation that caption-centric pre-training constrains visual grounding in multimodal models. It introduces MMRPT, a framework that identifies vision-dependent language units via attention, constructs vision-sensitive masked data, and trains with reinforcement learning to reward visual-grounded inference, including a structured think/answer output and a reward that combines Exact Match and strict prefix criteria. Experimental results show consistent zero-shot improvements across diverse benchmarks and improved robustness under supervised fine-tuning, though some structured tasks like ChartQA derive less benefit. Overall, MMRPT provides a scalable, reasoning-centered pre-training objective that reduces reliance on captions and promotes deeper visual understanding in multimodal models.

Abstract

Multimodal pre-training remains constrained by the descriptive bias of image-caption pairs, leading models to favor surface linguistic cues over grounded visual understanding. We introduce MMRPT, a masked multimodal reinforcement pre-training framework that strengthens visual reasoning in MLLMs. We are the first to incorporate reinforcement learning directly into the pre-training of large vision-language models, enabling learning signals that reward visual grounding rather than caption imitation. MMRPT constructs masked multimodal data by estimating sentence-level visual dependency via attention over visual tokens and masking highly vision-dependent segments; the model reconstructs these spans through vision-grounded reasoning guided by a semantic-visual reward. Experiments show consistent zero-shot gains across diverse benchmarks and substantially improved robustness under supervised fine-tuning, demonstrating that reinforcement-driven masked reasoning provides a more reliable and generalizable pre-training objective for multimodal models.

MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning

TL;DR

The paper tackles the limitation that caption-centric pre-training constrains visual grounding in multimodal models. It introduces MMRPT, a framework that identifies vision-dependent language units via attention, constructs vision-sensitive masked data, and trains with reinforcement learning to reward visual-grounded inference, including a structured think/answer output and a reward that combines Exact Match and strict prefix criteria. Experimental results show consistent zero-shot improvements across diverse benchmarks and improved robustness under supervised fine-tuning, though some structured tasks like ChartQA derive less benefit. Overall, MMRPT provides a scalable, reasoning-centered pre-training objective that reduces reliance on captions and promotes deeper visual understanding in multimodal models.

Abstract

Multimodal pre-training remains constrained by the descriptive bias of image-caption pairs, leading models to favor surface linguistic cues over grounded visual understanding. We introduce MMRPT, a masked multimodal reinforcement pre-training framework that strengthens visual reasoning in MLLMs. We are the first to incorporate reinforcement learning directly into the pre-training of large vision-language models, enabling learning signals that reward visual grounding rather than caption imitation. MMRPT constructs masked multimodal data by estimating sentence-level visual dependency via attention over visual tokens and masking highly vision-dependent segments; the model reconstructs these spans through vision-grounded reasoning guided by a semantic-visual reward. Experiments show consistent zero-shot gains across diverse benchmarks and substantially improved robustness under supervised fine-tuning, demonstrating that reinforcement-driven masked reasoning provides a more reliable and generalizable pre-training objective for multimodal models.

Paper Structure

This paper contains 10 sections, 13 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Layer-wise within- and between-sentence variance of token-level visual dependency in Qwen2.5-VL-7B. Both metrics peak in the upper-middle decoder layers (29–31), indicating that these layers best differentiate visually grounded content from language-only content. We therefore adopt this layer region for dependency estimation in MMRPT.