Table of Contents
Fetching ...

InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models

Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, Yanwen Guo, Wenhai Wang, Kai Chen, Yu Qiao, Hongjie Zhang

TL;DR

The paper tackles fragmentation and transferability challenges in SVG modeling by introducing SAgoge, a massive multimodal SVG dataset, and SArena, a standardized benchmark. It then proposes InternSVG, a unified multimodal LLM that uses SVG-specific tokens and a two-stage curriculum to jointly learn SVG understanding, editing, and generation. Across SArena and previous benchmarks, InternSVG delivers substantial gains over both open-source and proprietary baselines, demonstrating positive transfer across tasks and domains. The work advances scalable, unified vector-graphic reasoning with practical implications for web design, visualization, and CAD-oriented workflows.

Abstract

General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.

InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models

TL;DR

The paper tackles fragmentation and transferability challenges in SVG modeling by introducing SAgoge, a massive multimodal SVG dataset, and SArena, a standardized benchmark. It then proposes InternSVG, a unified multimodal LLM that uses SVG-specific tokens and a two-stage curriculum to jointly learn SVG understanding, editing, and generation. Across SArena and previous benchmarks, InternSVG delivers substantial gains over both open-source and proprietary baselines, demonstrating positive transfer across tasks and domains. The work advances scalable, unified vector-graphic reasoning with practical implications for web design, visualization, and CAD-oriented workflows.

Abstract

General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.

Paper Structure

This paper contains 36 sections, 1 equation, 26 figures, 18 tables.

Figures (26)

  • Figure 1: Overview of our InternSVG family. SAgoge provides large-scale and diverse SVG samples across multiple domains. SArena enables comprehensive assessment of existing MLLMs on SVG tasks. InternSVG supports unified modeling for SVG understanding, editing, and generation.
  • Figure 2: Overview of the dataset construction pipeline. Raw SVGs are gathered from the web and a custom synthesis pipeline, then normalized to a $128\times128$ canvas and simplified to shorten code. The rendered images or videos, processed SVG code, and handcrafted prompts are fed to an MLLM to synthesize high-quality training samples for understanding, editing, and generation.
  • Figure 3: (a) Overall architecuture of InternSVG. (b) Distribution of the number of tokens per SVG before and after adding customized special tokens in the tokenizer. (c) Comparison of training loss curves between subword-based embedding initialization and random initialization.
  • Figure 4: Visualization of SVG samples generated by InternSVG.
  • Figure 5: Qualitative comparison of Text-to-SVG performance between baselines and InternSVG on SArena-Icon. Red cross icons denote cases where the model failed to generate a valid SVG output.
  • ...and 21 more figures