Table of Contents
Fetching ...

Vero: An Open RL Recipe for General Visual Reasoning

Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu

Abstract

What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) show such broad visual reasoning is within reach, but the recipe behind them remains unclear, locked behind proprietary reinforcement learning (RL) pipelines with non-public data. We introduce Vero, a family of fully open VLMs that matches or exceeds existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answer formats. Vero achieves state-of-the-art performance, improving over four base models by 3.6-5.3 points on average across VeroEval, our suite of 30 challenging benchmarks. Starting from Qwen3-VL-8B-Instruct, Vero outperforms Qwen3-VL-8B-Thinking on 23 of 30 benchmarks without additional proprietary thinking data. When trained from the same base model, Vero-600K exceeds existing RL datasets across task categories. Systematic ablations reveal that different task categories elicit qualitatively distinct reasoning patterns that transfer poorly in isolation, suggesting that broad data coverage is the primary driver of strong RL scaling. All data, code, and models are released.

Vero: An Open RL Recipe for General Visual Reasoning

Abstract

What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) show such broad visual reasoning is within reach, but the recipe behind them remains unclear, locked behind proprietary reinforcement learning (RL) pipelines with non-public data. We introduce Vero, a family of fully open VLMs that matches or exceeds existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answer formats. Vero achieves state-of-the-art performance, improving over four base models by 3.6-5.3 points on average across VeroEval, our suite of 30 challenging benchmarks. Starting from Qwen3-VL-8B-Instruct, Vero outperforms Qwen3-VL-8B-Thinking on 23 of 30 benchmarks without additional proprietary thinking data. When trained from the same base model, Vero-600K exceeds existing RL datasets across task categories. Systematic ablations reveal that different task categories elicit qualitatively distinct reasoning patterns that transfer poorly in isolation, suggesting that broad data coverage is the primary driver of strong RL scaling. All data, code, and models are released.

Paper Structure

This paper contains 98 sections, 12 equations, 17 figures, 11 tables.

Figures (17)

  • Figure 1: Vero achieves state-of-the-art performance across six task categories using a fully open RL recipe. Top left: Training curves versus RL FLOPs for Vero-600K compared with existing open RL datasets, all finetuned from Qwen2.5-VL-7B-Instruct; dashed lines indicate training beyond one epoch. Top center: Summary of the broad task categories targeted in Vero-600K and VeroEval. Top right: Vero compared to Qwen3-VL-Instruct and Qwen3-VL-Thinking overall on the 30 benchmarks in VeroEval. Bottom row: Same as top right, but displaying per-category scores.
  • Figure 2: Composition of Vero-600K. The inner ring shows six task categories, each allocated 100K samples (600K total), and the outer ring shows their 59 constituent datasets. The categories represent real-world use cases and cover distinct visual reasoning capabilities (Sections \ref{['sec:cross_gen']} and \ref{['sec:behavioral-analysis']}). Categories are uniformly sampled to balance learning across tasks.
  • Figure 3: Vero-600K data curation pipeline. Starting from over 250 candidate datasets, we assign each to one of six task categories and apply multi-stage selection and filtering: heuristic screening (size, resolution, answer format), manual quality control, LLM-based question filtering for ambiguity and verifiability, and answer filtering for stable reward computation. The retained data are combined into a uniformly weighted mixture across task categories and used for on-policy RL training with task-routed rewards.
  • Figure 4: Examples from each task category illustrate the breadth of Vero-600K. We show representative samples of our training data from the six categories, highlighting the diversity of visual inputs, question formats, and answer types covered by our training set.
  • Figure 5: Accuracy verifiers for our multi-task reward. Each card illustrates one of eight verifiers used in Vero, with an example visual question, ground-truth answer, and model prediction. Verifiers include binary rewards (string match, numeric via math-verify, ordering, point-in-box) and graded rewards (IoU/F1 for grounding, field match for web actions, rule-based checks for instruction following, and LLM-as-judge for open-ended responses). This task-routed design enables accurate reward computation across diverse answer formats.
  • ...and 12 more figures