Table of Contents
Fetching ...

A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

Tianle Chen, Deepti Ghadiyaram

Abstract

As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = $83.43\%$ vs $34.93\%$).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.

A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

Abstract

As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = vs ).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.

Paper Structure

This paper contains 34 sections, 7 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Multi-modal typography example. A clean audio-video input depicting a cat leads to the correct prediction cat. We inject distractors -- spoken (audio typography), on-screen text (visual typography), or distractor text prompt (turned off in this example). We show that the model prediction shifts toward the injected target (horse), indicating the vulnerability of audio-visual MLLMs.
  • Figure 2: Sensitivity of audio typography to volume, temporal placement, repetition, and voice on MMA-Bench for Qwen2.5-Omni-7B. Each panel shows the injected-target prediction rate for audio and visual questions. Volume has the strongest effect; later placement and higher repetition also strengthen the attack, while voice choice has a comparatively modest impact.
  • Figure 3: Effectiveness--stealth trade-off of audio typography attacks. Audio- and visual-question accuracy are shown against relative RMS and speech-recognition shift. Lower accuracy indicates a stronger attack, while lower values on both stealth axes indicate better stealth. Volume is most effective but least stealthy, whereas repetition offers a better trade-off.
  • Figure 4: Default dataset-specific spoken injection templates. Class-label tasks such as MMA-Bench and Music-AVQA use short wrong-answer statements. The option-based WorldSense benchmark uses an answer-style phrase that names an incorrect option together with its semantic content. The MetaHarm safety evaluation uses benign spoken cues to bias the model toward a harmless judgment.
  • Figure 5: Extended effectiveness--stealth analysis across all stealth metrics. Each panel plots average task accuracy against one normalized stealth cost, with lower-left indicating simultaneously stronger and stealthier attacks. The same family-wise pattern remains visible across metrics: gain traces the strongest but least stealthy regime, repetition provides the best effectiveness--stealth balance, temporal position changes effectiveness with minimal low-level distortion, and voice identity has only a secondary effect. RMS exhibits the strongest monotonic relation with attack strength, while spectral flatness is the weakest, showing that not all low-level metrics are equally diagnostic of semantic audio attacks.
  • ...and 10 more figures