Kiki or Bouba? Sound Symbolism in Vision-and-Language Models

Morris Alper; Hadar Averbuch-Elor

Kiki or Bouba? Sound Symbolism in Vision-and-Language Models

Morris Alper, Hadar Averbuch-Elor

TL;DR

The paper demonstrates that vision-language models harbor non-arbitrary sound-symbolic mappings, akin to the kiki-bouba effect, by constructing pseudowords and probing CLIP and Stable Diffusion in a zero-shot setting. Using prompts and a multimodal embedding framework, the authors define geometric and phonetic scores to quantify associations between written sounds and visual properties, reporting robust discriminative metrics and a human perceptual alignment. The work provides a computational approach to sound symbolism, reveals emergent surface-form knowledge in multimodal encoders, and discusses implications for cognitive science and interpretability while acknowledging dataset- and language-specific caveats and proposing avenues for multilingual exploration.

Abstract

Although the mapping between sound and meaning in human language is assumed to be largely arbitrary, research in cognitive science has shown that there are non-trivial correlations between particular sounds and meanings across languages and demographic groups, a phenomenon known as sound symbolism. Among the many dimensions of meaning, sound symbolism is particularly salient and well-demonstrated with regards to cross-modal associations between language and the visual domain. In this work, we address the question of whether sound symbolism is reflected in vision-and-language models such as CLIP and Stable Diffusion. Using zero-shot knowledge probing to investigate the inherent knowledge of these models, we find strong evidence that they do show this pattern, paralleling the well-known kiki-bouba effect in psycholinguistics. Our work provides a novel method for demonstrating sound symbolism and understanding its nature using computational tools. Our code will be made publicly available.

Kiki or Bouba? Sound Symbolism in Vision-and-Language Models

TL;DR

Abstract

Paper Structure (32 sections, 7 figures, 9 tables)

This paper contains 32 sections, 7 figures, 9 tables.

Introduction
Related Work
Computational Paradigm for Sound Symbolic Probing
Pseudoword Construction
Zero-shot Knowledge Probing
Evaluation Method
Results and Evaluation
Experimental Details
Quantitative Evaluation
User Study
Qualitative Results
Discussion, Limitations and Future Work
Background on Phonetics and Sound Symbolism
Experimental Details
Image Generation Settings
...and 17 more sections

Figures (7)

Figure 1: Illustration of the kiki--bouba effect. The shapes on the far left illustrate stimuli used in the classic kiki--bouba experiment. The remaining images are random generations from Stable Diffusion with the prompt a 3D rendering of a $\left<w\right>$ shaped object, where $\left<w\right>$$\in$ {kiki, bouba}. Which of of these images do you think were generated using pseudoword kiki and which with bouba? See below for the answer.
Figure 2: Graphemes sorted by average geometric score $\gamma_{\text{$\left<w\right>$\xspace}}$ for pseudowords $\left<w\right>$ whose first syllable contains the given grapheme, calculated with Stable Diffusion and CLIP. Characters are colored based on their ground-truth association (red for $\largestar$, blue for $\bigcircle$). Consonants are shown above and vowels below the arrow. We see that the two classes are mostly well-discriminated by these scores, especially when calculated Stable Diffusion. In this visualization, consonants and vowels are displayed on separate scales and are not positioned absolutely with respect to each other.
Figure 3: Ground-truth adjectives sorted by phonetic score $\phi_{\text{$\left<w\right>$\xspace}}$, calculated with Stable Diffusion and CLIP. Adjectives are colored based on their ground-truth association (red for $\largestar$, blue for $\bigcircle$). We see that the two classes are highly differentiated by phonetic score for both models, as further reflected in the corresponding metrics in Table \ref{['tab:association_metrics']}.
Figure 4: Image generations for pseudowords with high (top 20%) and low (bottom 20%) geometric scores. We visualize random selections of pseudoword--image pairs for each category. Pseudowords with class ($\largestar$ or $\bigcircle$) that does not match its geometric score are indicated in red. As seen above, the shapes of the generated images noticeably correlate with the pseudoword class.
Figure 5: Images generated from pseudowords reminiscent of real English words. For each pseudoword we display an associated image generation and the automatically detected closest English word. Pseudowords with high or low geometric scores (in top or bottom 20% relative to all pseudowords) which do not match their class ($\largestar$ or $\bigcircle$) are indicated in red.
...and 2 more figures

Kiki or Bouba? Sound Symbolism in Vision-and-Language Models

TL;DR

Abstract

Kiki or Bouba? Sound Symbolism in Vision-and-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)