Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

Shan Ning; Longtian Qiu; Xuming He

Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

Shan Ning, Longtian Qiu, Xuming He

TL;DR

Wiki-R1 is proposed, a data-generation-based curriculum reinforcement learning framework that systematically incentivizes reasoning in MLLMs for KB-VQA and introduces a controlable curriculum data generation, which manipulates the retriever to produce samples at desired difficulty levels, and a curriculum sampling strategy that selects informative samples likely to yield non-zero advantages during RL updates.

Abstract

Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions about an image by integrating external knowledge, posing significant challenges due to noisy retrieval and the structured, encyclopedic nature of the knowledge base. These characteristics create a distributional gap from pretrained multimodal large language models (MLLMs), making effective reasoning and domain adaptation difficult in the post-training stage. In this work, we propose \textit{Wiki-R1}, a data-generation-based curriculum reinforcement learning framework that systematically incentivizes reasoning in MLLMs for KB-VQA. Wiki-R1 constructs a sequence of training distributions aligned with the model's evolving capability, bridging the gap from pretraining to the KB-VQA target distribution. We introduce \textit{controllable curriculum data generation}, which manipulates the retriever to produce samples at desired difficulty levels, and a \textit{curriculum sampling strategy} that selects informative samples likely to yield non-zero advantages during RL updates. Sample difficulty is estimated using observed rewards and propagated to unobserved samples to guide learning. Experiments on two KB-VQA benchmarks, Encyclopedic VQA and InfoSeek, demonstrate that Wiki-R1 achieves new state-of-the-art results, improving accuracy from 35.5\% to 37.1\% on Encyclopedic VQA and from 40.1\% to 44.1\% on InfoSeek. The project page is available at https://artanic30.github.io/project_pages/WikiR1/.

Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

TL;DR

Abstract

Paper Structure (52 sections, 6 equations, 5 figures, 8 tables, 2 algorithms)

This paper contains 52 sections, 6 equations, 5 figures, 8 tables, 2 algorithms.

Introduction
Related Work
Knowledge-based Visual Question Answering
Curriculum Learning for RL
Wiki-R1
Task Definition
Training Objective
Curriculum Data Generation
Controllable Data Generation
Gap-Level Schedule
Curriculum Sampling with Observation Propagation
Sampling Schedule
Difficulty Estimation via Observation Propagation
Experiments
Evaluation Benchmarks
...and 37 more sections

Figures (5)

Figure 1: ( \ref{['fig:teaser_a']}) and ( \ref{['fig:teaser_b']}): Training dynamics of DAPO on KB-VQA. RL optimization suffers from a high proportion of zero-advantage samples and low training accuracy, highlighting the distribution gap between pretraining and the KB-VQA target domain. ( \ref{['fig:teaser_c']}): Motivation of Wiki-R1. To mitigate this gap, Wiki-R1 generates a sequence of training distributions with progressively reduced discrepancies and employs a curriculum sampling strategy to select informative samples.
Figure 2: Left: Controllable curriculum data generation. We manipulate the retriever to generate training samples with gradually increasing difficulty, adaptively aligned with the model’s evolving capability, bridging the gap from pretraining to the KB-VQA target distribution. Right: Curriculum sampling with observation propagation. We adaptively select informative samples likely to produce non-zero advantage during RL updates, with sample difficulty estimated from observed rewards and propagated to unobserved examples.
Figure 3: Left: Number of ignored trajectories. Trajectories are ignored when they provide zero advantage and no training signal; a larger number indicates lower training efficiency. Right: Accuracy over training iterations. Performance is reported on the EVQA test set and the InfoSeek validation set. The star denotes an increase in curriculum difficulty during Wiki-R1 training.
Figure 4: Comparison across gap thresholds and smoothing factors. We report EVQA test and InfoSeek validation performance across training iterations under different hyperparameter settings. For the left two figures, the star denotes an increase in curriculum difficulty during Wiki-R1 training. The chosen hyperparameter $\tau$ is 0.55 and the $\alpha$ is 0.8.
Figure 5: Performance over training iterations for three independent runs on EVQA and InfoSeek.

Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

TL;DR

Abstract

Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

Authors

TL;DR

Abstract

Table of Contents

Figures (5)