Table of Contents
Fetching ...

Unlocking the Latent Canvas: Eliciting and Benchmarking Symbolic Visual Expression in LLMs

Yiren Zheng, Shibo Li, Jiaming Liu, Haofan Wang, Yiren Song

Abstract

Current multimodal approaches predominantly treat visual generation as an external process, relying on pixel rendering or code execution, thereby overlooking the native visual representation capabilities latent within Large Language Models (LLMs). In this work, we unlock this potential through ASCII art, a compact, efficient, and text-native visual format. We introduce SVE-ASCII, a unified framework designed to elicit and benchmark Symbolic Visual Expression directly within the pure text space. To address the scarcity of systematic resources, we construct ASCIIArt-7K, a high-quality dataset synthesized via a novel "Seed-and-Evolve" pipeline that augments human-curated anchors through in-context stylistic editing. We further implement a unified instruction-tuning strategy that jointly optimizes for both Generation (Text-to-ASCII) and Understanding (ASCII-to-Text). Crucially, our experiments reveal a critical phenomenon regarding task duality: while it is established that perception aids generation, we provide compelling evidence that generative training significantly enhances visual comprehension. This confirms a mutually reinforcing cycle in symbolic visual processing, a relationship previously hypothesized but rarely empirically demonstrated in the visual domain. We release our dataset, the ASCIIArt-Bench benchmark, and the SVE-ASCII model, establishing a robust baseline for native text-based visual intelligence.

Unlocking the Latent Canvas: Eliciting and Benchmarking Symbolic Visual Expression in LLMs

Abstract

Current multimodal approaches predominantly treat visual generation as an external process, relying on pixel rendering or code execution, thereby overlooking the native visual representation capabilities latent within Large Language Models (LLMs). In this work, we unlock this potential through ASCII art, a compact, efficient, and text-native visual format. We introduce SVE-ASCII, a unified framework designed to elicit and benchmark Symbolic Visual Expression directly within the pure text space. To address the scarcity of systematic resources, we construct ASCIIArt-7K, a high-quality dataset synthesized via a novel "Seed-and-Evolve" pipeline that augments human-curated anchors through in-context stylistic editing. We further implement a unified instruction-tuning strategy that jointly optimizes for both Generation (Text-to-ASCII) and Understanding (ASCII-to-Text). Crucially, our experiments reveal a critical phenomenon regarding task duality: while it is established that perception aids generation, we provide compelling evidence that generative training significantly enhances visual comprehension. This confirms a mutually reinforcing cycle in symbolic visual processing, a relationship previously hypothesized but rarely empirically demonstrated in the visual domain. We release our dataset, the ASCIIArt-Bench benchmark, and the SVE-ASCII model, establishing a robust baseline for native text-based visual intelligence.
Paper Structure (24 sections, 5 equations, 6 figures, 5 tables)

This paper contains 24 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Capabilities of SVE-ASCII. Our unified framework achieves high-quality ASCII art generation (a) while accurately interpreting the underlying shapes and semantics (b). Quantitative results in (c) demonstrate that our model significantly outperforms state-of-the-art baselines, and (d) highlights its versatility in synthesizing diverse ASCII art across varying categories and scales.
  • Figure 2: Overview of SVE-ASCII. First, we construct the ASCIIArt-7K using a scalable synthesis pipeline. Subsequently, we fine-tune Qwen2.5-7B-Instruct via Understanding-Generation Joint Training. By switching supervision signals between tasks, our model achieves efficient bidirectional capabilities in both understanding and generating ASCII art.
  • Figure 3: Generation from Scratch vs. Imitation.The quality of the generated ASCII art cup improves significantly when an imitation example is provided in the prompt.
  • Figure 4: Impact of Topology on Style Transfer. We synthesize variants for different seeds. Visually insensitive subjects (e.g., trucks) demonstrate robust style transfer across diverse shapes. Conversely, visually sensitive subjects (e.g., rabbits) struggle with large morphological changes, often failing to transform into distinct animals such as foxes or cats.
  • Figure 5: Qualitative comparison with SOTA methods on the Generation Task. We compare the proposed method with SOTA baselines on our evaluation benchmarks, presenting results from the Recall Subset on the left and the Generalization Subset on the right.
  • ...and 1 more figures