Can Large Language Models Understand Symbolic Graphics Programs?
Zeju Qiu, Weiyang Liu, Haiwen Feng, Zhen Liu, Tim Z. Xiao, Katherine M. Collins, Joshua B. Tenenbaum, Adrian Weller, Michael J. Black, Bernhard Schölkopf
TL;DR
The paper introduces SGP-Bench, a scalable benchmark to assess large language models' semantic understanding of symbolic graphics programs rendered from SVG and CAD representations. Understanding is defined as answering semantic questions about the rendered content using only the symbolic program, requiring visual imagination and long-range program reasoning, with semantic-consistency tests under perturbations. Empirical results show a scaling trend where larger models perform better, with proprietary models (GPT/Claude) outperforming open-source peers, yet semantic understanding remains challenging, especially for SVG. The authors propose Symbolic Instruction Tuning (SIT), which leverages a large, semantically descriptive dataset derived from rendered graphics to finetune LLMs, yielding significant gains in symbolic understanding and notable improvements in general reasoning across a broad set of benchmarks. SIT thus provides a practical path to improve LLM capabilities in reasoning with structured symbolic graphics, while highlighting fundamental gaps between human and machine visual reasoning in this domain.
Abstract
Against the backdrop of enthusiasm for large language models (LLMs), there is a growing need to scientifically assess their capabilities and shortcomings. This is nontrivial in part because it is difficult to find tasks which the models have not encountered during training. Utilizing symbolic graphics programs, we propose a domain well-suited to test multiple spatial-semantic reasoning skills of LLMs. Popular in computer graphics, these programs procedurally generate visual data. While LLMs exhibit impressive skills in general program synthesis and analysis, symbolic graphics programs offer a new layer of evaluation: they allow us to test an LLM's ability to answer semantic questions about the images or 3D geometries without a vision encoder. To semantically understand the symbolic programs, LLMs would need to possess the ability to "imagine" and reason how the corresponding graphics content would look with only the symbolic description of the local curvatures and strokes. We use this task to evaluate LLMs by creating a large benchmark for the semantic visual understanding of symbolic graphics programs, built procedurally with minimal human effort. Particular emphasis is placed on transformations of images that leave the image level semantics invariant while introducing significant changes to the underlying program. We evaluate commercial and open-source LLMs on our benchmark to assess their ability to reason about visual output of programs, finding that LLMs considered stronger at reasoning generally perform better. Lastly, we introduce a novel method to improve this ability -- Symbolic Instruction Tuning (SIT), in which the LLM is finetuned with pre-collected instruction data on symbolic graphics programs. Interestingly, we find that SIT not only improves LLM's understanding on symbolic programs, but it also improves general reasoning ability on various other benchmarks.
