Table of Contents
Fetching ...

Visual-ERM: Reward Modeling for Visual Equivalence

Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, Yuhang Zang

Abstract

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.

Visual-ERM: Reward Modeling for Visual Equivalence

Abstract

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.
Paper Structure (60 sections, 12 equations, 14 figures, 12 tables)

This paper contains 60 sections, 12 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Overview of vision-to-code reward modeling and Visual-ERM. (a) Vision-to-code transforms structured visual inputs (charts, tables, and SVGs) into structured textual outputs (code or markup). Rendering the predicted text back into an image enables evaluation in the visual space. (b) Prior rewards either rely on text-based rule metrics, which ignore critical visual cues, or use vision-encoder feature similarity, which is coarse-grained and lacks interpretability. (c) Visual-ERM provides fine-grained, interpretable, and task-agnostic reward signals for vision-to-code, serving as a reliable supervisor in both RL pipelines and test-time scaling.
  • Figure 2: (a) Training Data Generation. Starting from raw images and associated text, we construct image pairs by (i) synthesizing negative samples via targeted edits that inject pre-defined error types, and (ii) sampling naturally occurring errors from direct model inference. We then obtain fine-grained annotations for each image pair using proprietary models. (b) Visual-ERM. We train Visual-ERM on the resulting data and integrate it into both RL training and test-time scaling.
  • Figure 3: VC-RewardBench. We construct VC-RewardBench by first leveraging several advanced proprietary models for preliminary annotation, followed by manual consolidation and filtering, resulting in 1,335 high-quality instances.
  • Figure 4: DINO vs. Visual-ERM. DINO scores are semantically-biased, which can lead to inflated rewards and reward hacking. In contrast, Visual-ERM accounts for fine-grained visual details and provides more precise and interpretable evaluations.
  • Figure 5: Test-Time Scaling Cases. Leveraging Visual-ERM's fine-grained feedback, we enable an inference, reflection and refinement loop at test time. Predictions are generated as text and rendered as images for visualization.
  • ...and 9 more figures