Can Large Language Models Understand Symbolic Graphics Programs?

Zeju Qiu; Weiyang Liu; Haiwen Feng; Zhen Liu; Tim Z. Xiao; Katherine M. Collins; Joshua B. Tenenbaum; Adrian Weller; Michael J. Black; Bernhard Schölkopf

Can Large Language Models Understand Symbolic Graphics Programs?

Zeju Qiu, Weiyang Liu, Haiwen Feng, Zhen Liu, Tim Z. Xiao, Katherine M. Collins, Joshua B. Tenenbaum, Adrian Weller, Michael J. Black, Bernhard Schölkopf

TL;DR

The paper introduces SGP-Bench, a scalable benchmark to assess large language models' semantic understanding of symbolic graphics programs rendered from SVG and CAD representations. Understanding is defined as answering semantic questions about the rendered content using only the symbolic program, requiring visual imagination and long-range program reasoning, with semantic-consistency tests under perturbations. Empirical results show a scaling trend where larger models perform better, with proprietary models (GPT/Claude) outperforming open-source peers, yet semantic understanding remains challenging, especially for SVG. The authors propose Symbolic Instruction Tuning (SIT), which leverages a large, semantically descriptive dataset derived from rendered graphics to finetune LLMs, yielding significant gains in symbolic understanding and notable improvements in general reasoning across a broad set of benchmarks. SIT thus provides a practical path to improve LLM capabilities in reasoning with structured symbolic graphics, while highlighting fundamental gaps between human and machine visual reasoning in this domain.

Abstract

Against the backdrop of enthusiasm for large language models (LLMs), there is a growing need to scientifically assess their capabilities and shortcomings. This is nontrivial in part because it is difficult to find tasks which the models have not encountered during training. Utilizing symbolic graphics programs, we propose a domain well-suited to test multiple spatial-semantic reasoning skills of LLMs. Popular in computer graphics, these programs procedurally generate visual data. While LLMs exhibit impressive skills in general program synthesis and analysis, symbolic graphics programs offer a new layer of evaluation: they allow us to test an LLM's ability to answer semantic questions about the images or 3D geometries without a vision encoder. To semantically understand the symbolic programs, LLMs would need to possess the ability to "imagine" and reason how the corresponding graphics content would look with only the symbolic description of the local curvatures and strokes. We use this task to evaluate LLMs by creating a large benchmark for the semantic visual understanding of symbolic graphics programs, built procedurally with minimal human effort. Particular emphasis is placed on transformations of images that leave the image level semantics invariant while introducing significant changes to the underlying program. We evaluate commercial and open-source LLMs on our benchmark to assess their ability to reason about visual output of programs, finding that LLMs considered stronger at reasoning generally perform better. Lastly, we introduce a novel method to improve this ability -- Symbolic Instruction Tuning (SIT), in which the LLM is finetuned with pre-collected instruction data on symbolic graphics programs. Interestingly, we find that SIT not only improves LLM's understanding on symbolic programs, but it also improves general reasoning ability on various other benchmarks.

Can Large Language Models Understand Symbolic Graphics Programs?

TL;DR

Abstract

Paper Structure (38 sections, 26 figures, 7 tables)

This paper contains 38 sections, 26 figures, 7 tables.

Introduction
Semantic Understanding of Symbolic Graphics Programs
Why is Understanding Symbolic Graphics Programs Interesting?
A Benchmark for Symbolic Graphics Program Understanding
Dataset Creation Pipeline
Benchmarking Semantic Understanding
Benchmarking Semantic Consistency
Prediction Entropy of LLMs and Humans
Improving LLMS with Symbolic Instruction Tuning
SIT Can Improve General Reasoning Ability
A critical View on current LLM's Capability
Related Work and Acknowledgment
Appendix
Benchmark Details
Data preparation
...and 23 more sections

Figures (26)

Figure 1: Our benchmark assesses LLMs' understanding of symbolic graphics programs in semantic understanding and prediction consistency. Note that the LLM can only see symbolic graphics programs and the corresponding questions. The rendered images are not input to the LLM.
Figure 2: Illustration of the symbolic graphics program understanding task.
Figure 3: A qualitative example of CAD reasoning.
Figure 4: Qualitative examples of how LLMs reason over the symbolic program and obtain their answers.
Figure 5: OpenAI-o1 still suffers from the spurious correlation from the Ebbinghaus illusion while reasoning over images (a). In contrast, OpenAI-o1 works perfectly fine while reasoning over symbolic graphics programs directly (b) or indirectly (c).
...and 21 more figures

Can Large Language Models Understand Symbolic Graphics Programs?

TL;DR

Abstract

Can Large Language Models Understand Symbolic Graphics Programs?

Authors

TL;DR

Abstract

Table of Contents

Figures (26)