Table of Contents
Fetching ...

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, Alex Jinpeng Wang

TL;DR

This paper introduces VCode, a benchmark that treats multimodal understanding as SVG code generation from natural images, enabling downstream reasoning over rendered vector graphics. It defines CodeVQA to assess whether the rendered SVG preserves the image's symbolic meaning, and introduces VCoder, a two-pronged approach combining Thinking with Revision and Acting with Visual Tools to improve visual-centric coding. Across MM-Vet, MMMU, and CV-Bench, VCoder achieves a substantial overall gain of +12.3 over a strong baseline, while frontier models still lag behind the original image upper bound, highlighting the gap between language-centric and visual-centric coding. The work suggests that symbolic visual representations can better support reasoning and agentic tasks, with practical implications for interpretable multimodal AI and future end-to-end vision–language coding systems.

Abstract

Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

TL;DR

This paper introduces VCode, a benchmark that treats multimodal understanding as SVG code generation from natural images, enabling downstream reasoning over rendered vector graphics. It defines CodeVQA to assess whether the rendered SVG preserves the image's symbolic meaning, and introduces VCoder, a two-pronged approach combining Thinking with Revision and Acting with Visual Tools to improve visual-centric coding. Across MM-Vet, MMMU, and CV-Bench, VCoder achieves a substantial overall gain of +12.3 over a strong baseline, while frontier models still lag behind the original image upper bound, highlighting the gap between language-centric and visual-centric coding. The work suggests that symbolic visual representations can better support reasoning and agentic tasks, with practical implications for interpretable multimodal AI and future end-to-end vision–language coding systems.

Abstract

Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.

Paper Structure

This paper contains 34 sections, 3 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of VCode. An RGB image (left, represented by pixels) is translated into symbolic SVG code (middle) via VLM as Coder and rendered back into an image (right, represented by code), aiming to preserve symbolic meaning (e.g., “three sheep on the farm”). As shown at the bottom, VCode provides a compact, interpretable, and executable representation of original images.
  • Figure 2: Left: Distributions of tasks in VCode, showing the proportions of general, professional, and vision-centric categories. Right: Illustration of the CodeVQA prototype: given an image and a question (e.g., “What is the lamp on, a side table or a nightstand?”), the policy model answers based on the rendered image. A correct answer indicates that the SVG representation preserves the semantic content of the original image, while an incorrect answer highlights room for improvement.
  • Figure 3: Augmenting Coders with Test-time Revision & Visual Tools.Left: Thinking with Revision -- the model performs initial coding, comments on discrepancies between original and rendered images, and iteratively refines the SVG code. Right: Acting with Vision Tools -- external modules provide cues on categories, locations, shapes, colors, and text, which are translated into structured code signals to guide generation. These techniques yield more faithful and accurate renderings.
  • Figure 3: Effects by vision tools modules, where Loc. denotes Location, C. denotes Category, and S. denotes Shape.
  • Figure 4: Effects by different input modes of Claude-4-Opus
  • ...and 2 more figures