Table of Contents
Fetching ...

Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLM Reward Models

Weijie Qiu, Dai Guan, Junxin Wang, Zhihang Li, Yongbo Gai, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

Abstract

Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Reinforcement Learning (RL) to explicitly enhance rubric quality. Concretely, we train lightweight proxy agents (Proxy-SFT and Proxy-RL) that take a candidate rubric together with the original query and preference pair, and then predict the preference ordering using only the rubric as evidence. The proxy's prediction accuracy serves as a rubric-quality reward, incentivizing the model to produce rubrics that are internally consistent and transferable. With ~50k data samples, Proxy-GRM reaches state-of-the-art results on the VL-Reward Bench, Multimodal Reward Bench, and MM-RLHF-Reward Bench, outperforming the methods trained on four times the data. Ablations show Proxy-SFT is a stronger verifier than Proxy-RL, and implicit reward aggregation performs best. Crucially, the learned rubrics transfer to unseen evaluators, improving reward accuracy at test time without additional training. Our code is available at https://github.com/Qwen-Applications/Proxy-GRM.

Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLM Reward Models

Abstract

Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Reinforcement Learning (RL) to explicitly enhance rubric quality. Concretely, we train lightweight proxy agents (Proxy-SFT and Proxy-RL) that take a candidate rubric together with the original query and preference pair, and then predict the preference ordering using only the rubric as evidence. The proxy's prediction accuracy serves as a rubric-quality reward, incentivizing the model to produce rubrics that are internally consistent and transferable. With ~50k data samples, Proxy-GRM reaches state-of-the-art results on the VL-Reward Bench, Multimodal Reward Bench, and MM-RLHF-Reward Bench, outperforming the methods trained on four times the data. Ablations show Proxy-SFT is a stronger verifier than Proxy-RL, and implicit reward aggregation performs best. Crucially, the learned rubrics transfer to unseen evaluators, improving reward accuracy at test time without additional training. Our code is available at https://github.com/Qwen-Applications/Proxy-GRM.
Paper Structure (37 sections, 9 equations, 10 figures, 9 tables)

This paper contains 37 sections, 9 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Overview of rubric-quality reward modeling via a proxy.Top-left: Traditional GRM training uses reward only from the final answer; rubric quality is implicit/latent, making the supervision a black box and allowing the model to guess correct answers without reliable guidance. Top-right: Our framework explicitly generates a transferable rubric alongside the final answer and optimizes the GRM with a rubric-quality reward computed by a proxy model that predicts preference given the question and generated rubric; correct proxy predictions provide strong guidance to improve rubric quality. Bottom: Data-efficiency comparison on VL-RewardBench, MultiModal Reward Bench and MM-RLHF-Reward Bench shows our method (1$\times$ data) outperforming baselines trained with $\geq$4$\times$ more data.
  • Figure 2: Training pipeline for Proxy-GRM with transferable rubrics.(A) Data & distillation. We merge multiple VLM preference datasets and apply automated filters to obtain ${\sim}60\text{k}$ samples. A teacher then separates the data into Correct and Hard subsets, which are allocated to train the proxy agent and to cold-start / RL-train Proxy-GRM. (B) Proxy agent training. Starting from Qwen2.5-VL-7B-Instruct, we train a proxy agent via Proxy-SFT followed by Proxy-RL, producing an evaluator that provides rubric-quality assessment signals. (C) Proxy-GRM training. We first cold-start the GRM policy with SFT. We then perform reinforcement learning with accuracy, proxy (rubric-quality), and format rewards, and optimize with GRPO updates. (D) Transferable rubric. A rubric is transferable if an independent agent can follow it to select the correct answer; otherwise, it is non-transferable and may lead to wrong decisions despite appearing plausible.
  • Figure 3: Qualitative comparison of rubrics generated under different proxy agent configurations. Given the same multimodal input with a pairwise preference (r1 is preferred), we compare four variants: RL (no proxy) produces rubrics with redundant criteria and an incorrect verdict; RL+Qwen2.5-VL-3B fails to correct repetition and distorts weight allocation, also yielding an incorrect answer; RL+Qwen2.5-VL-32B introduces a more informative criterion (Contextual Awareness) and arrives at the correct verdict; RL+Proxy-SFT generates the most discriminative rubric with Contextual Relevance, balanced weights, and a correct prediction. Stronger proxy agents guide the policy toward more specific and non-redundant rubrics.
  • Figure 4: Comparison of average scores across different Proxy Reward Feedback on three benchmarks. Each subplot reports the mean score for each variant, with numeric annotations above bars. Triangular markers indicate the best and worst performing variants within each benchmark. Overall, the base model achieves the highest scores on VL-RewardBench and Instructability, while performance on MM-RLHF-RewardBench remains consistently high with smaller variation across variants.
  • Figure 5: System Prompt for Distillation and RL.
  • ...and 5 more figures