Table of Contents
Fetching ...

VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation

Bocheng Zou, Mu Cai, Jianrui Zhang, Yong Jae Lee

TL;DR

VGBench introduces the first comprehensive benchmark for evaluating large language models on vector graphics understanding and generation across SVG, TikZ, and Graphviz formats. It combines VGQA (understanding) and VGen (generation) tasks with a large dataset (4279 QA pairs and 5845 VG captions) and uses prompting techniques to assess performance of diverse LLMs (including GPT-4) and multimodal baselines (LLaVA) against rasterized counterparts. Key findings show LLMs excel with high-level VG formats (TikZ/Graphviz) and achieve strong generation capabilities, while SVG remains challenging without advanced prompting; Chain-of-Thought and In-Context Learning provide notable gains for low-level formats. The work provides a data-and-pipeline release to foster future research in vector-graphics understanding and generation, offering insights into format-induced Semantic gaps and prompting strategies for LLMs.

Abstract

In the realm of vision models, the primary mode of representation is using pixels to rasterize the visual world. Yet this is not always the best or unique way to represent visual content, especially for designers and artists who depict the world using geometry primitives such as polygons. Vector graphics (VG), on the other hand, offer a textual representation of visual content, which can be more concise and powerful for content like cartoons, sketches and scientific figures. Recent studies have shown promising results on processing vector graphics with capable Large Language Models (LLMs). However, such works focus solely on qualitative results, understanding, or a specific type of vector graphics. We propose VGBench, a comprehensive benchmark for LLMs on handling vector graphics through diverse aspects, including (a) both visual understanding and generation, (b) evaluation of various vector graphics formats, (c) diverse question types, (d) wide range of prompting techniques, (e) under multiple LLMs and (f) comparison with VLMs on rasterized representations. Evaluating on our collected 4279 understanding and 5845 generation samples, we find that LLMs show strong capability on both aspects while exhibiting less desirable performance on low-level formats (SVG). Both data and evaluation pipeline will be open-sourced at https://vgbench.github.io.

VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation

TL;DR

VGBench introduces the first comprehensive benchmark for evaluating large language models on vector graphics understanding and generation across SVG, TikZ, and Graphviz formats. It combines VGQA (understanding) and VGen (generation) tasks with a large dataset (4279 QA pairs and 5845 VG captions) and uses prompting techniques to assess performance of diverse LLMs (including GPT-4) and multimodal baselines (LLaVA) against rasterized counterparts. Key findings show LLMs excel with high-level VG formats (TikZ/Graphviz) and achieve strong generation capabilities, while SVG remains challenging without advanced prompting; Chain-of-Thought and In-Context Learning provide notable gains for low-level formats. The work provides a data-and-pipeline release to foster future research in vector-graphics understanding and generation, offering insights into format-induced Semantic gaps and prompting strategies for LLMs.

Abstract

In the realm of vision models, the primary mode of representation is using pixels to rasterize the visual world. Yet this is not always the best or unique way to represent visual content, especially for designers and artists who depict the world using geometry primitives such as polygons. Vector graphics (VG), on the other hand, offer a textual representation of visual content, which can be more concise and powerful for content like cartoons, sketches and scientific figures. Recent studies have shown promising results on processing vector graphics with capable Large Language Models (LLMs). However, such works focus solely on qualitative results, understanding, or a specific type of vector graphics. We propose VGBench, a comprehensive benchmark for LLMs on handling vector graphics through diverse aspects, including (a) both visual understanding and generation, (b) evaluation of various vector graphics formats, (c) diverse question types, (d) wide range of prompting techniques, (e) under multiple LLMs and (f) comparison with VLMs on rasterized representations. Evaluating on our collected 4279 understanding and 5845 generation samples, we find that LLMs show strong capability on both aspects while exhibiting less desirable performance on low-level formats (SVG). Both data and evaluation pipeline will be open-sourced at https://vgbench.github.io.
Paper Structure (42 sections, 9 figures, 9 tables)

This paper contains 42 sections, 9 figures, 9 tables.

Figures (9)

  • Figure 1: VGBench is the first comprehensive vector graphics (VG) understanding and generation benchmark across diverse vector graphics types, question types, and prompting techniques on a rich set of SoTA LLMs. Our large scale benchmark consists of 4279 multi-choice question-answer pairs and 5845 VG-caption pairs.
  • Figure 2: Examples of the vector graphics QAs for diverse formats including SVG, TikZ, and Graphviz in VGQA.
  • Figure 3: The semi-automatic curation pipeline in VGQA. Vector graphics are converted into PNG format, then GPT-4V is utilized to generate the questions and answers (QA) candidates. Finally, human annotators filter the QA pairs to obtain the high-quality QA dataset.
  • Figure 4: Word distribution based on question categories for each vector graphic type. The top 20 words are sampled from the answers to each type of question. Words with a frequency of less than 4% are represented as "OTHERS".
  • Figure 5: The automatic generation pipeline in VGen. The vector graphics collected from the Internet is first rendered into the ground truth image then captioned by GPT-4V. The caption is fed into the target LLM to generate new vector graphics, which will be compared with the caption using CLIP Score and FID for a similarity score. The score is then compared with the similarity score between the ground truth and the same caption as the upper bound.
  • ...and 4 more figures