Table of Contents
Fetching ...

ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text

Kerry Luo, Michael Fu, Joshua Peguero, Husnain Malik, Anvay Patil, Joyce Lin, Megan Van Overborg, Ryan Sarmiento, Kevin Zhu

TL;DR

Large language models exhibit limited spatial reasoning for visually structured text. ASCIIBench provides a curated ASCII-art benchmark (5,315 pieces, 752 classes) and a fine-tuned CLIP variant to assess classification and generation of ASCII art. Results show CLIP-based separation is weak for most categories, highlighting a representation bottleneck, with improvements only for well-formed classes. The work proposes structure-aware embeddings and standardized rendering/prompts to advance multimodal understanding of symbolic visuals.

Abstract

Large language models (LLMs) have demonstrated several emergent behaviors with scale, including reasoning and fluency in long-form text generation. However, they continue to struggle with tasks requiring precise spatial and positional reasoning. ASCII art, a symbolic medium where characters encode structure and form, provides a unique probe of this limitation. We introduce ASCIIBench, a novel benchmark for evaluating both the generation and classification of ASCII-text images. ASCIIBench consists of a filtered dataset of 5,315 class-labeled ASCII images and is, to our knowledge, the first publicly available benchmark of its kind. Alongside the dataset, we release weights for a fine-tuned CLIP model adapted to capture ASCII structure, enabling the evaluation of LLM-generated ASCII art. Our analysis shows that cosine similarity over CLIP embeddings fails to separate most ASCII categories, yielding chance-level performance even for low-variance classes. In contrast, classes with high internal mean similarity exhibit clear discriminability, revealing that the bottleneck lies in representation rather than generational variance. These findings position ASCII art as a stress test for multimodal representations and motivate the development of new embedding methods or evaluation metrics tailored to symbolic visual modalities. All resources are available at https://github.com/ASCIIBench/ASCIIBench.

ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text

TL;DR

Large language models exhibit limited spatial reasoning for visually structured text. ASCIIBench provides a curated ASCII-art benchmark (5,315 pieces, 752 classes) and a fine-tuned CLIP variant to assess classification and generation of ASCII art. Results show CLIP-based separation is weak for most categories, highlighting a representation bottleneck, with improvements only for well-formed classes. The work proposes structure-aware embeddings and standardized rendering/prompts to advance multimodal understanding of symbolic visuals.

Abstract

Large language models (LLMs) have demonstrated several emergent behaviors with scale, including reasoning and fluency in long-form text generation. However, they continue to struggle with tasks requiring precise spatial and positional reasoning. ASCII art, a symbolic medium where characters encode structure and form, provides a unique probe of this limitation. We introduce ASCIIBench, a novel benchmark for evaluating both the generation and classification of ASCII-text images. ASCIIBench consists of a filtered dataset of 5,315 class-labeled ASCII images and is, to our knowledge, the first publicly available benchmark of its kind. Alongside the dataset, we release weights for a fine-tuned CLIP model adapted to capture ASCII structure, enabling the evaluation of LLM-generated ASCII art. Our analysis shows that cosine similarity over CLIP embeddings fails to separate most ASCII categories, yielding chance-level performance even for low-variance classes. In contrast, classes with high internal mean similarity exhibit clear discriminability, revealing that the bottleneck lies in representation rather than generational variance. These findings position ASCII art as a stress test for multimodal representations and motivate the development of new embedding methods or evaluation metrics tailored to symbolic visual modalities. All resources are available at https://github.com/ASCIIBench/ASCIIBench.

Paper Structure

This paper contains 38 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Example classification prompt with result
  • Figure 2: Cosine similarity distributions. Green indicates positive (intra-class) pairs, red indicates negative (inter-class) pairs.
  • Figure 3: Top 30 Class Histogram
  • Figure 4: Class Distribution Pie Chart
  • Figure 5: T-SNE visualization of class embeddings
  • ...and 2 more figures