ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL

Xingyu Lu; Jinpeng Wang; YiFan Zhang; Shijie Ma; Xiao Hu; Tianke Zhang; Haonan fan; Kaiyu Jiang; Changyi Liu; Kaiyu Tang; Bin Wen; Fan Yang; Tingting Gao; Han Li; Chun Yuan

ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL

Xingyu Lu, Jinpeng Wang, YiFan Zhang, Shijie Ma, Xiao Hu, Tianke Zhang, Haonan fan, Kaiyu Jiang, Changyi Liu, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Chun Yuan

TL;DR

This work proposes ContextRL, a novel framework that leverages context augmentation to overcome bottlenecks in the RLVR model, and provides the reward model with full reference solutions as context to enhance Identifiability and Reachability.

Abstract

We propose ContextRL, a novel framework that leverages context augmentation to overcome these bottlenecks. Specifically, to enhance Identifiability, we provide the reward model with full reference solutions as context, enabling fine-grained process verification to filter out false positives (samples with the right answer but low-quality reasoning process). To improve Reachability, we introduce a multi-turn sampling strategy where the reward model generates mistake reports for failed attempts, guiding the policy to "recover" correct responses from previously all-negative groups. Experimental results on 11 perception and reasoning benchmarks show that ContextRL significantly improves knowledge discovery efficiency. Notably, ContextRL enables the Qwen3-VL-8B model to achieve performance comparable to the 32B model, outperforming standard RLVR baselines by a large margin while effectively mitigating reward hacking. Our in-depth analysis reveals the significant potential of contextual information for improving reward model accuracy and document the widespread occurrence of reward hacking, offering valuable insights for future RLVR research.

ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL

TL;DR

Abstract

Paper Structure (49 sections, 13 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 49 sections, 13 equations, 5 figures, 4 tables, 1 algorithm.

INTRODUCTION
METHODOLOGY
RLVR for MLLMs
RLVR's Components
RLVR Workflow: GRPO as an Example
(3) Advantage Construction.
(4) Knowledge Internalization.
Brief Summary
Information Bottlenecks in RLVR
Bottleneck I: Reachability of Positive Solutions
Bottleneck II: Identifiability of Correctness
ContextRL: Augmenting RLVR with Context
Context-Augmented Reward Model
Reward Context: Full Solution vs. Final Answer.
Context-Augmented Reward Model.
...and 34 more sections

Figures (5)

Figure 1: Overview of ContextRL. (a) Context-augmented reward model. (b) Context-augmented policy. (c) Training workflow.
Figure 2: False Positive Example (Hallucination).
Figure A-1: Regular reward instruction template.
Figure A-2: Context-augmented reward instruction template.
Figure A-3: False Positive Example (Reasoning Error).

ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL

TL;DR

Abstract

ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL

Authors

TL;DR

Abstract

Table of Contents

Figures (5)