Table of Contents
Fetching ...

VectorGym: A Multitask Benchmark for SVG Code Generation, Sketching, and Editing

Juan Rodriguez, Haotian Zhang, Abhay Puri, Tianyang Zhang, Rishav Pramanik, Meng Lin, Xiaoqing Xie, Marco Terral, Darsh Kaushik, Aly Shariff, Perouz Taslakian, Spandana Gella, Sai Rajeswar, David Vazquez, Christopher Pal, Marco Pedersoli

Abstract

We introduce VectorGym, a comprehensive benchmark suite for Scalable Vector Graphics (SVG) that spans generation from text and sketches, complex editing, and visual understanding. VectorGym addresses the lack of realistic, challenging benchmarks aligned with professional design workflows. Our benchmark comprises four tasks with expert human-authored annotations: the novel Sketch2SVG task (VG-Sketch); a new SVG editing dataset (VG-Edit) featuring complex, multi-step edits with higher-order primitives; Text2SVG generation (VG-Text); and SVG captioning (VG-Cap). Unlike prior benchmarks that rely on synthetic edits, VectorGym provides gold-standard human annotations that require semantic understanding and design intent. We also propose a multi-task reinforcement learning approach that jointly optimizes across all four tasks using rendering-based rewards. Our method, built on GRPO with curriculum learning, trains a Qwen3-VL 8B model that achieves state-of-the-art performance among open-source models, surpassing much larger models including Qwen3-VL 235B and matching GPT-4o. We also introduce a VLM-as-a-Judge metric for SVG generation, validated through human correlation studies. Our evaluation of frontier VLMs reveals significant performance gaps, positioning VectorGym as a rigorous framework for advancing visual code generation. VectorGym is publicly available on huggingface.co/datasets/ServiceNow/VectorGym.

VectorGym: A Multitask Benchmark for SVG Code Generation, Sketching, and Editing

Abstract

We introduce VectorGym, a comprehensive benchmark suite for Scalable Vector Graphics (SVG) that spans generation from text and sketches, complex editing, and visual understanding. VectorGym addresses the lack of realistic, challenging benchmarks aligned with professional design workflows. Our benchmark comprises four tasks with expert human-authored annotations: the novel Sketch2SVG task (VG-Sketch); a new SVG editing dataset (VG-Edit) featuring complex, multi-step edits with higher-order primitives; Text2SVG generation (VG-Text); and SVG captioning (VG-Cap). Unlike prior benchmarks that rely on synthetic edits, VectorGym provides gold-standard human annotations that require semantic understanding and design intent. We also propose a multi-task reinforcement learning approach that jointly optimizes across all four tasks using rendering-based rewards. Our method, built on GRPO with curriculum learning, trains a Qwen3-VL 8B model that achieves state-of-the-art performance among open-source models, surpassing much larger models including Qwen3-VL 235B and matching GPT-4o. We also introduce a VLM-as-a-Judge metric for SVG generation, validated through human correlation studies. Our evaluation of frontier VLMs reveals significant performance gaps, positioning VectorGym as a rigorous framework for advancing visual code generation. VectorGym is publicly available on huggingface.co/datasets/ServiceNow/VectorGym.

Paper Structure

This paper contains 32 sections, 3 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 2: Visualization of VectorGym Test Examples (Editing Task). We randomly sample 21 examples from the test set, and show the editing instruction to perform, along with the source and target SVG. These examples are part of VG-Edit.
  • Figure 3: Qualitative results on VectorGym. We display VLM-Judge and Human scores on a scale from 0 to 5. Each task shows three validation samples alongside the strongest models in our evaluation. Human ratings tend to be stricter, while VLM judges are more permissive and often cluster around mid-range values when uncertain.
  • Figure 4: VG-Sketch Qualitative Results. The leftmost column displays the input raster sketch, followed by the outputs from top-performing models. Gemini 3 Pro demonstrates superior fidelity in preserving topological structure compared to GPT-5.1 and others.
  • Figure 5: VG-Edit Qualitative Results. Left to right: natural language edit instruction, input SVG, and model outputs. Gemini 3 Pro, Claude 4.5 Sonnet, and GPT5.1 effectively execute complex semantic modifications, whereas our trained models struggle to follow some multi-step edits.
  • Figure 6: Visualization of VG-Sketch Test Examples. We randomly sample 30 examples, and show the sketch and the target vector.
  • ...and 6 more figures