Table of Contents
Fetching ...

Distinct social-linguistic processing between humans and large audio-language models: Evidence from model-brain alignment

Hanlin Wu, Xufeng Duan, Zhenguang Cai

TL;DR

This study probes whether large audio-language models (LALMs) process speaker-contextualized language in a manner parallel to humans by aligning model-derived surprisal and entropy with human EEG markers, specifically the $N400$ and $P600$. Using Mandarin stimuli and bilingual prompts, two LALMs (Qwen2-Audio and Ultravox 0.5) were evaluated for sensitivity to speaker-content incongruency across social and biological violations. Results show that Qwen2-Audio exhibits increased surprisal for incongruent content and its surprisal predicts human $N400$ amplitudes, whereas Ultravox 0.5 shows limited sensitivity and neither model reproduces the human-like dissociation between social and biological violations. Predictive uncertainty (entropy) in both models largely tracks linguistic properties rather than speaker context and does not strongly map to human responses, although Ultravox 0.5 shows a marginal association with $P600$. Overall, the findings reveal both potential and limitations of current LALMs in social-linguistic processing and highlight differences between human cognition and forward-predictive architectures, with implications for model design and training to support more natural human-AI interactions and for ethical considerations of bias.

Abstract

Voice-based AI development faces unique challenges in processing both linguistic and paralinguistic information. This study compares how large audio-language models (LALMs) and humans integrate speaker characteristics during speech comprehension, asking whether LALMs process speaker-contextualized language in ways that parallel human cognitive mechanisms. We compared two LALMs' (Qwen2-Audio and Ultravox 0.5) processing patterns with human EEG responses. Using surprisal and entropy metrics from the models, we analyzed their sensitivity to speaker-content incongruency across social stereotype violations (e.g., a man claiming to regularly get manicures) and biological knowledge violations (e.g., a man claiming to be pregnant). Results revealed that Qwen2-Audio exhibited increased surprisal for speaker-incongruent content and its surprisal values significantly predicted human N400 responses, while Ultravox 0.5 showed limited sensitivity to speaker characteristics. Importantly, neither model replicated the human-like processing distinction between social violations (eliciting N400 effects) and biological violations (eliciting P600 effects). These findings reveal both the potential and limitations of current LALMs in processing speaker-contextualized language, and suggest differences in social-linguistic processing mechanisms between humans and LALMs.

Distinct social-linguistic processing between humans and large audio-language models: Evidence from model-brain alignment

TL;DR

This study probes whether large audio-language models (LALMs) process speaker-contextualized language in a manner parallel to humans by aligning model-derived surprisal and entropy with human EEG markers, specifically the and . Using Mandarin stimuli and bilingual prompts, two LALMs (Qwen2-Audio and Ultravox 0.5) were evaluated for sensitivity to speaker-content incongruency across social and biological violations. Results show that Qwen2-Audio exhibits increased surprisal for incongruent content and its surprisal predicts human amplitudes, whereas Ultravox 0.5 shows limited sensitivity and neither model reproduces the human-like dissociation between social and biological violations. Predictive uncertainty (entropy) in both models largely tracks linguistic properties rather than speaker context and does not strongly map to human responses, although Ultravox 0.5 shows a marginal association with . Overall, the findings reveal both potential and limitations of current LALMs in social-linguistic processing and highlight differences between human cognition and forward-predictive architectures, with implications for model design and training to support more natural human-AI interactions and for ethical considerations of bias.

Abstract

Voice-based AI development faces unique challenges in processing both linguistic and paralinguistic information. This study compares how large audio-language models (LALMs) and humans integrate speaker characteristics during speech comprehension, asking whether LALMs process speaker-contextualized language in ways that parallel human cognitive mechanisms. We compared two LALMs' (Qwen2-Audio and Ultravox 0.5) processing patterns with human EEG responses. Using surprisal and entropy metrics from the models, we analyzed their sensitivity to speaker-content incongruency across social stereotype violations (e.g., a man claiming to regularly get manicures) and biological knowledge violations (e.g., a man claiming to be pregnant). Results revealed that Qwen2-Audio exhibited increased surprisal for speaker-incongruent content and its surprisal values significantly predicted human N400 responses, while Ultravox 0.5 showed limited sensitivity to speaker characteristics. Importantly, neither model replicated the human-like processing distinction between social violations (eliciting N400 effects) and biological violations (eliciting P600 effects). These findings reveal both the potential and limitations of current LALMs in processing speaker-contextualized language, and suggest differences in social-linguistic processing mechanisms between humans and LALMs.

Paper Structure

This paper contains 18 sections, 2 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Surprisal values from Qwen2-Audio and Ultravox 0.5 models for speaker-congruent (blue) and speaker-incongruent (red) utterances, shown separately for social and biological conditions in Chinese and English.
  • Figure 2: Entropy values from Qwen2-Audio and Ultravox 0.5 models for speaker-congruent (blue) and speaker-incongruent (red) utterances, shown separately for social and biological conditions in Chinese and English.
  • Figure 3: Main effect coefficients of Surprisal and Entropy on N400 and P600 amplitudes from LME analyses. Dark blue indicates a significant effect, light blue indicates marginal effects, and alice blue indicates non-significant effects for Qwen2-Audio and Ultravox 0.5 models.