Table of Contents
Fetching ...

OmniSVG: A Unified Scalable Vector Graphics Generation Model

Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Fukun Yin, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, Yu-Gang Jiang

TL;DR

OmniSVG introduces a unified multimodal framework that leverages native Vision-Language Models to generate complex, editable SVGs by tokenizing SVG commands and coordinates. It addresses coordinate hallucination and scalability through discrete SVG tokens and end-to-end training conditioned on multimodal prompts. The authors provide MMSVG-2M, a two-million-sample dataset, and MMSVG-Bench, a standardized evaluation protocol for Text-to-SVG and Image-to-SVG tasks. Empirical results show OmniSVG outperforming prior methods in both quality and efficiency, with strong qualitative examples and extensive ablations. This work offers a practical path for integrating high-fidelity SVG synthesis into professional design workflows.

Abstract

Scalable Vector Graphics (SVG) is an important image format widely adopted in graphic design because of their resolution independence and editability. The study of generating high-quality SVG has continuously drawn attention from both designers and researchers in the AIGC community. However, existing methods either produces unstructured outputs with huge computational cost or is limited to generating monochrome icons of over-simplified structures. To produce high-quality and complex SVG, we propose OmniSVG, a unified framework that leverages pre-trained Vision-Language Models (VLMs) for end-to-end multimodal SVG generation. By parameterizing SVG commands and coordinates into discrete tokens, OmniSVG decouples structural logic from low-level geometry for efficient training while maintaining the expressiveness of complex SVG structure. To further advance the development of SVG synthesis, we introduce MMSVG-2M, a multimodal dataset with two million richly annotated SVG assets, along with a standardized evaluation protocol for conditional SVG generation tasks. Extensive experiments show that OmniSVG outperforms existing methods and demonstrates its potential for integration into professional SVG design workflows.

OmniSVG: A Unified Scalable Vector Graphics Generation Model

TL;DR

OmniSVG introduces a unified multimodal framework that leverages native Vision-Language Models to generate complex, editable SVGs by tokenizing SVG commands and coordinates. It addresses coordinate hallucination and scalability through discrete SVG tokens and end-to-end training conditioned on multimodal prompts. The authors provide MMSVG-2M, a two-million-sample dataset, and MMSVG-Bench, a standardized evaluation protocol for Text-to-SVG and Image-to-SVG tasks. Empirical results show OmniSVG outperforming prior methods in both quality and efficiency, with strong qualitative examples and extensive ablations. This work offers a practical path for integrating high-fidelity SVG synthesis into professional design workflows.

Abstract

Scalable Vector Graphics (SVG) is an important image format widely adopted in graphic design because of their resolution independence and editability. The study of generating high-quality SVG has continuously drawn attention from both designers and researchers in the AIGC community. However, existing methods either produces unstructured outputs with huge computational cost or is limited to generating monochrome icons of over-simplified structures. To produce high-quality and complex SVG, we propose OmniSVG, a unified framework that leverages pre-trained Vision-Language Models (VLMs) for end-to-end multimodal SVG generation. By parameterizing SVG commands and coordinates into discrete tokens, OmniSVG decouples structural logic from low-level geometry for efficient training while maintaining the expressiveness of complex SVG structure. To further advance the development of SVG synthesis, we introduce MMSVG-2M, a multimodal dataset with two million richly annotated SVG assets, along with a standardized evaluation protocol for conditional SVG generation tasks. Extensive experiments show that OmniSVG outperforms existing methods and demonstrates its potential for integration into professional SVG design workflows.

Paper Structure

This paper contains 32 sections, 1 equation, 12 figures, 7 tables.

Figures (12)

  • Figure 1: OmniSVG is capable of autoregressively generating high-quality Scalable Vector Graphs (SVG) across a wide spectrum of complexity, from simple icons to intricate anime characters. OmniSVG demonstrates remarkable versatility in generating high-quality SVGs adhering to multimodal instructions, covering tasks like Text-to-SVG, Image-to-SVG, and Character-Reference SVG, making it a powerful and flexible solution for diverse creative tasks.
  • Figure 2: Overview of OmniSVG. OmniSVG is built on a pre-trained vision-language model Qwen2.5-VL and incorporates an SVG tokenizer. The model tokenizes both text and image inputs as prefix tokens, while the SVG tokenizer encodes vector graphics commands into a unified representation space.
  • Figure 3: Qualitative Comparison with SOTA Methods on Text-to-SVG Task. We compare the propose method with SOTA Text-to-SVG methods on our evaluation benchmarks, namely Icon and Illustration.
  • Figure 4: Qualitative Comparison with SOTA Methods on Image-to-SVG Task. We compare the propose method with SOTA Image-to-SVG methods on our evaluation benchmarks.
  • Figure 5: Generated SVG with Character-Reference (CRef) by OmniSVG.
  • ...and 7 more figures