Chart Deep Research in LVLMs via Parallel Relative Policy Optimization

Jiajin Tang; Gaoyang; Wenjie Wang; Sibei Yang; Xing Chen

Chart Deep Research in LVLMs via Parallel Relative Policy Optimization

Jiajin Tang, Gaoyang, Wenjie Wang, Sibei Yang, Xing Chen

TL;DR

The proposed PRPO and MCDR-Bench jointly establish a unified framework that systematically advances chart deep research through enhanced collaborative training and objective evaluation, enabling quantifiable evaluation of deep research capabilities.

Abstract

With the rapid advancement of data science, charts have evolved from simple numerical presentation tools to essential instruments for insight discovery and decision-making support. However, current chart data intelligence exhibits significant limitations in deep research capabilities, with existing methods predominantly addressing shallow tasks such as visual recognition or factual question-answering, rather than the complex reasoning and high-level data analysis that deep research requires. This limitation stems from two primary technical bottlenecks: at the training level, existing post-training techniques exhibit deficiencies in handling multi-dimensional reward signal interference and heterogeneous data gradient conflicts, preventing models from achieving balanced development across multiple capability dimensions; at the evaluation level, current methods remain limited to factual retrieval and basic computation, failing to assess end-to-end analytic reasoning and other deep research capabilities. To address the training challenge, we propose PRPO, which performs parallel optimization across reward dimensions and capability partitioning across data types, effectively disentangling conflicts between heterogeneous data and multi-dimensional reward signals while ensuring optimization stability. For the evaluation challenge, we construct MCDR-Bench based on the ``error uniqueness principle," transforming subjective generation assessment into objective error identification through controllable error injection, enabling quantifiable evaluation of deep research capabilities. Experimental validation confirms that the proposed PRPO and MCDR-Bench jointly establish a unified framework that systematically advances chart deep research through enhanced collaborative training and objective evaluation.

Chart Deep Research in LVLMs via Parallel Relative Policy Optimization

TL;DR

Abstract

Paper Structure (24 sections, 10 equations, 13 figures, 5 tables)

This paper contains 24 sections, 10 equations, 13 figures, 5 tables.

Introduction
Related Work
MCDR-Bench
Parallel Relative Policy Optimization
Preliminary
Reward Parallel Relative Policy Optimization
Data Parallel Relative Policy Optimization
Parallel Relative Policy Optimization
Experiments
Settings
MCDR-Bench Results
ChartQAPRO Results
Ablation Study
Conclusion
Reward Analysis
...and 9 more sections

Figures (13)

Figure 1: Optimization trajectories of GRPO vs. PRPO. Left: under multi-dimensional rewards, GRPO suffers from signal interference and limited exploration. Right: PRPO decomposes optimization across reward dimensions and data types, reducing interference and enabling specialized learning, which yields more effective exploration and near-optimal performance.
Figure 2: Multi-agent annotation process for multimodal chart deep research. The process consists of five stages: (1) Background Acquisition: retrieving domain-specific knowledge; (2) Fact Extraction: extracting atomic data elements; (3) Relationship Construction: modeling topological and logical connections; (4) Deep Research Report Generation: synthesizing comprehensive reports; (5) Forecast/Plan: proposing strategic recommendations. Human filtering ensures quality control throughout the process. The five specialized expert agents are represented by .
Figure 3: Demonstration of GRPO and our PRPO. PRPO unifies Reward-PRPO and Data-PRPO, partitioning data into capability-based groups and decomposing rewards across dimensions. This approach addresses multi-dimensional reward conflicts and data optimization conflicts, enabling balanced training across complex tasks.
Figure 4: Comparison of multi-dimensional reward processing between GRPO and PRPO. (1) Original multi-dimensional rewards for the same sample. (2a) GRPO aggregates rewards into a single scalar value, resulting in low variance and weak signal strength, which diminishes the distinct optimization signals from each reward dimension. (3a) GRPO's advantage calculation further exhibits weak discrimination due to the loss of variability in aggregated rewards. (2b) PRPO independently processes each reward dimension, preserving signal integrity. (3b) PRPO's dimension-specific advantage calculation achieves higher discrimination, effectively capturing the unique contributions of each reward type and enabling more effective optimization.
Figure 5: Comparison of reward values (left) and response lengths (right) during training for Qwen2.5-VL-3B (top) and Qwen2.5-VL-7B (bottom) models under different reward strategies.
...and 8 more figures

Chart Deep Research in LVLMs via Parallel Relative Policy Optimization

TL;DR

Abstract

Chart Deep Research in LVLMs via Parallel Relative Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (13)