Table of Contents
Fetching ...

Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models

Zhaochen Wang, Bryan Hooi, Yiwei Wang, Ming-Hsuan Yang, Zi Huang, Yujun Cai

TL;DR

The paper exposes a fundamental text-priority bias in vision-language models when confronted with adversarial ASCII art, where textual semantics often override visual structure. By constructing a dataset of 700 ASCII-art images from 100 negative words across seven character types and evaluating five state-of-the-art models, the authors demonstrate that semantic content dominates recognition and that visual cues degrade as semantic complexity increases. They test mitigations via visual parameter tuning and prompting, which yield only modest improvements, indicating that architectural changes are necessary. The findings have practical implications for content moderation and safety against adversarial multimodal inputs, guiding future research toward deeper architectural alignment in multimodal systems.

Abstract

Vision-language models (VLMs) have advanced rapidly in processing multimodal information, but their ability to reconcile conflicting signals across modalities remains underexplored. This work investigates how VLMs process ASCII art, a unique medium where textual elements collectively form visual patterns, potentially creating semantic-visual conflicts. We introduce a novel evaluation framework that systematically challenges five state-of-the-art models (including GPT-4o, Claude, and Gemini) using adversarial ASCII art, where character-level semantics deliberately contradict global visual patterns. Our experiments reveal a strong text-priority bias: VLMs consistently prioritize textual information over visual patterns, with visual recognition ability declining dramatically as semantic complexity increases. Various mitigation attempts through visual parameter tuning and prompt engineering yielded only modest improvements, suggesting that this limitation requires architectural-level solutions. These findings uncover fundamental flaws in how current VLMs integrate multimodal information, providing important guidance for future model development while highlighting significant implications for content moderation systems vulnerable to adversarial examples.

Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models

TL;DR

The paper exposes a fundamental text-priority bias in vision-language models when confronted with adversarial ASCII art, where textual semantics often override visual structure. By constructing a dataset of 700 ASCII-art images from 100 negative words across seven character types and evaluating five state-of-the-art models, the authors demonstrate that semantic content dominates recognition and that visual cues degrade as semantic complexity increases. They test mitigations via visual parameter tuning and prompting, which yield only modest improvements, indicating that architectural changes are necessary. The findings have practical implications for content moderation and safety against adversarial multimodal inputs, guiding future research toward deeper architectural alignment in multimodal systems.

Abstract

Vision-language models (VLMs) have advanced rapidly in processing multimodal information, but their ability to reconcile conflicting signals across modalities remains underexplored. This work investigates how VLMs process ASCII art, a unique medium where textual elements collectively form visual patterns, potentially creating semantic-visual conflicts. We introduce a novel evaluation framework that systematically challenges five state-of-the-art models (including GPT-4o, Claude, and Gemini) using adversarial ASCII art, where character-level semantics deliberately contradict global visual patterns. Our experiments reveal a strong text-priority bias: VLMs consistently prioritize textual information over visual patterns, with visual recognition ability declining dramatically as semantic complexity increases. Various mitigation attempts through visual parameter tuning and prompt engineering yielded only modest improvements, suggesting that this limitation requires architectural-level solutions. These findings uncover fundamental flaws in how current VLMs integrate multimodal information, providing important guidance for future model development while highlighting significant implications for content moderation systems vulnerable to adversarial examples.

Paper Structure

This paper contains 26 sections, 22 figures, 3 tables.

Figures (22)

  • Figure 1: An example of adversarial ASCII art: the VLM can recognize text at the detail level but fails to perceive the macro-visual structure (the 'BAD' ASCII art).
  • Figure 2: An overview of the process for constructing the ASCII art dataset and two experiments (character-type influence and visual robustness assessment)
  • Figure 3: An example of prompt and model response in sentiment analysis
  • Figure 4: The distribution of sentiment predictions across VLMs. All models show a consistent negative–neutral–positive pattern, with sentiment predictions closely aligned with the emotional polarity of the characters: L1 (meaningless) leads to negative judgments, L3 and L4 (neutral) to neutral judgments, and L5 and L7 (positive) to positive judgments.
  • Figure 5: GPT-4o's graphical recognition accuracy improves significantly in L3 and L4 when given visual cues, whereas L7 shows minimal enhancement under similar conditions.
  • ...and 17 more figures