Table of Contents
Fetching ...

Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism

Jinhong Jeong, Sunghyun Lee, Jaeyoung Lee, Seonah Han, Youngjae Yu

TL;DR

This study investigates whether Multimodal Large Language Models exhibit human-like sound-symbolic associations by probing phonosemantic intuition across text, IPA, and audio inputs. The authors build LEX-ICON, a large multilingual mimetic-word dataset with natural and constructed words annotated along 25 semantic dimensions, and evaluate MLLMs on semantic-dimension prediction. They perform comprehensive analyses, including macro-F1-based predictions and layer-wise phoneme attention, revealing modality-specific strengths and gaps relative to human data. The work provides a quantitative, cross-linguistic framework linking cognitive linguistics and AI interpretability, and introduces a new avenue for analyzing how form and meaning cohere in multimodal language models.

Abstract

Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs' performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models' layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs' phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models' focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs' interpretability.

Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism

TL;DR

This study investigates whether Multimodal Large Language Models exhibit human-like sound-symbolic associations by probing phonosemantic intuition across text, IPA, and audio inputs. The authors build LEX-ICON, a large multilingual mimetic-word dataset with natural and constructed words annotated along 25 semantic dimensions, and evaluate MLLMs on semantic-dimension prediction. They perform comprehensive analyses, including macro-F1-based predictions and layer-wise phoneme attention, revealing modality-specific strengths and gaps relative to human data. The work provides a quantitative, cross-linguistic framework linking cognitive linguistics and AI interpretability, and introduces a new avenue for analyzing how form and meaning cohere in multimodal language models.

Abstract

Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs' performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models' layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs' phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models' focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs' interpretability.

Paper Structure

This paper contains 57 sections, 4 equations, 10 figures, 27 tables.

Figures (10)

  • Figure 1: Phonetic iconicity investigation for MLLMs using natural and constructed mimetic words from text and audio modalities in LEX-ICON. We conduct quantitative evaluations for up to 25 semantic dimensions and examine layer-wise attention fraction scores to identify how phonemes and meanings are related within the models.
  • Figure 2: A comprehensive figure for the data construction flow of LEX-ICON. (1) We manually collect 8,052 mimetic words and definitions from dictionaries, and systematically construct 2,930 disyllabic pseudo-words. (2) Using four LLMs (GPT-4.1, Qwen3-32B, Gemma-3-27B, and Gemini-2.5-flash), we automatically annotate each word with semantic dimensions based on its definitions. (3) For natural words, we retain features agreed upon by all models. For constructed words, we filter out features that are close to neutral. (4) The final dataset contains 10,982 words with 84,932 semantic features with varied input types.
  • Figure 3: Macro-F1 score results for the semantic dimension A/B test. "Natural" and "Constructed" are results of LLM experiments, calculated by averaging all three input types (original text, IPA, and audio). Each dot represents each model's score for a given dimension. Human evaluation results only contain the "Audio" input type with sampled data for experimental feasibility, yet achieving superior scores compared to the baseline that demonstrate LEX-ICON's reliability.
  • Figure 4: Pearson correlation scores with human evaluation results by word group and input type. Higher scores reflect greater similarity to humans' semantic dimension score distributions, where Qwen2.5-Omni-7B scores the highest correlation (maximum $r = 0.579$). In all models, constructed words elicit responses that are closer to human tendencies than natural words.
  • Figure 5: Advantage scores (macro-F1 differences) of audio inputs over the original text inputs by word group. X-axis indicates audio advantage scores for natural words, while Y-axis stands for constructed words. Each dot represents one semantic dimension, reflecting patterns aligned with linguistic implications, with an overall correlation (Pearson $r = 0.681$, Spearman $\rho = 0.705$).
  • ...and 5 more figures