Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLM Reward Models

Weijie Qiu; Dai Guan; Junxin Wang; Zhihang Li; Yongbo Gai; Mengyu Zhou; Erchao Zhao; Xiaoxi Jiang; Guanjun Jiang

Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLM Reward Models

Weijie Qiu, Dai Guan, Junxin Wang, Zhihang Li, Yongbo Gai, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

Abstract

Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Reinforcement Learning (RL) to explicitly enhance rubric quality. Concretely, we train lightweight proxy agents (Proxy-SFT and Proxy-RL) that take a candidate rubric together with the original query and preference pair, and then predict the preference ordering using only the rubric as evidence. The proxy's prediction accuracy serves as a rubric-quality reward, incentivizing the model to produce rubrics that are internally consistent and transferable. With ~50k data samples, Proxy-GRM reaches state-of-the-art results on the VL-Reward Bench, Multimodal Reward Bench, and MM-RLHF-Reward Bench, outperforming the methods trained on four times the data. Ablations show Proxy-SFT is a stronger verifier than Proxy-RL, and implicit reward aggregation performs best. Crucially, the learned rubrics transfer to unseen evaluators, improving reward accuracy at test time without additional training. Our code is available at https://github.com/Qwen-Applications/Proxy-GRM.

Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLM Reward Models

Abstract

Paper Structure (37 sections, 9 equations, 10 figures, 9 tables)

This paper contains 37 sections, 9 equations, 10 figures, 9 tables.

Introduction
Related Work
Reward Models for Vision-Language Models
Rubric-Based Evaluation and Reward Modeling
Reinforcement Learning for Language and Vision-Language Models
Process Supervision and Verification
Methods
Problem Formulation
Proxy Agent Training
Policy Model Training
Stage 1: Cold-Start Supervised Fine-Tuning.
Stage 2: Proxy-Guided Reinforcement Learning.
Inference with Proxy Verification
Experiments
Implementation Details
...and 22 more sections

Figures (10)

Figure 1: Overview of rubric-quality reward modeling via a proxy.Top-left: Traditional GRM training uses reward only from the final answer; rubric quality is implicit/latent, making the supervision a black box and allowing the model to guess correct answers without reliable guidance. Top-right: Our framework explicitly generates a transferable rubric alongside the final answer and optimizes the GRM with a rubric-quality reward computed by a proxy model that predicts preference given the question and generated rubric; correct proxy predictions provide strong guidance to improve rubric quality. Bottom: Data-efficiency comparison on VL-RewardBench, MultiModal Reward Bench and MM-RLHF-Reward Bench shows our method (1$\times$ data) outperforming baselines trained with $\geq$4$\times$ more data.
Figure 2: Training pipeline for Proxy-GRM with transferable rubrics.(A) Data & distillation. We merge multiple VLM preference datasets and apply automated filters to obtain ${\sim}60\text{k}$ samples. A teacher then separates the data into Correct and Hard subsets, which are allocated to train the proxy agent and to cold-start / RL-train Proxy-GRM. (B) Proxy agent training. Starting from Qwen2.5-VL-7B-Instruct, we train a proxy agent via Proxy-SFT followed by Proxy-RL, producing an evaluator that provides rubric-quality assessment signals. (C) Proxy-GRM training. We first cold-start the GRM policy with SFT. We then perform reinforcement learning with accuracy, proxy (rubric-quality), and format rewards, and optimize with GRPO updates. (D) Transferable rubric. A rubric is transferable if an independent agent can follow it to select the correct answer; otherwise, it is non-transferable and may lead to wrong decisions despite appearing plausible.
Figure 3: Qualitative comparison of rubrics generated under different proxy agent configurations. Given the same multimodal input with a pairwise preference (r1 is preferred), we compare four variants: RL (no proxy) produces rubrics with redundant criteria and an incorrect verdict; RL+Qwen2.5-VL-3B fails to correct repetition and distorts weight allocation, also yielding an incorrect answer; RL+Qwen2.5-VL-32B introduces a more informative criterion (Contextual Awareness) and arrives at the correct verdict; RL+Proxy-SFT generates the most discriminative rubric with Contextual Relevance, balanced weights, and a correct prediction. Stronger proxy agents guide the policy toward more specific and non-redundant rubrics.
Figure 4: Comparison of average scores across different Proxy Reward Feedback on three benchmarks. Each subplot reports the mean score for each variant, with numeric annotations above bars. Triangular markers indicate the best and worst performing variants within each benchmark. Overall, the base model achieves the highest scores on VL-RewardBench and Instructability, while performance on MM-RLHF-RewardBench remains consistently high with smaller variation across variants.
Figure 5: System Prompt for Distillation and RL.
...and 5 more figures

Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLM Reward Models

Abstract

Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLM Reward Models

Authors

Abstract

Table of Contents

Figures (10)