Table of Contents
Fetching ...

Kiki or Bouba? Sound Symbolism in Vision-and-Language Models

Morris Alper, Hadar Averbuch-Elor

TL;DR

The paper demonstrates that vision-language models harbor non-arbitrary sound-symbolic mappings, akin to the kiki-bouba effect, by constructing pseudowords and probing CLIP and Stable Diffusion in a zero-shot setting. Using prompts and a multimodal embedding framework, the authors define geometric and phonetic scores to quantify associations between written sounds and visual properties, reporting robust discriminative metrics and a human perceptual alignment. The work provides a computational approach to sound symbolism, reveals emergent surface-form knowledge in multimodal encoders, and discusses implications for cognitive science and interpretability while acknowledging dataset- and language-specific caveats and proposing avenues for multilingual exploration.

Abstract

Although the mapping between sound and meaning in human language is assumed to be largely arbitrary, research in cognitive science has shown that there are non-trivial correlations between particular sounds and meanings across languages and demographic groups, a phenomenon known as sound symbolism. Among the many dimensions of meaning, sound symbolism is particularly salient and well-demonstrated with regards to cross-modal associations between language and the visual domain. In this work, we address the question of whether sound symbolism is reflected in vision-and-language models such as CLIP and Stable Diffusion. Using zero-shot knowledge probing to investigate the inherent knowledge of these models, we find strong evidence that they do show this pattern, paralleling the well-known kiki-bouba effect in psycholinguistics. Our work provides a novel method for demonstrating sound symbolism and understanding its nature using computational tools. Our code will be made publicly available.

Kiki or Bouba? Sound Symbolism in Vision-and-Language Models

TL;DR

The paper demonstrates that vision-language models harbor non-arbitrary sound-symbolic mappings, akin to the kiki-bouba effect, by constructing pseudowords and probing CLIP and Stable Diffusion in a zero-shot setting. Using prompts and a multimodal embedding framework, the authors define geometric and phonetic scores to quantify associations between written sounds and visual properties, reporting robust discriminative metrics and a human perceptual alignment. The work provides a computational approach to sound symbolism, reveals emergent surface-form knowledge in multimodal encoders, and discusses implications for cognitive science and interpretability while acknowledging dataset- and language-specific caveats and proposing avenues for multilingual exploration.

Abstract

Although the mapping between sound and meaning in human language is assumed to be largely arbitrary, research in cognitive science has shown that there are non-trivial correlations between particular sounds and meanings across languages and demographic groups, a phenomenon known as sound symbolism. Among the many dimensions of meaning, sound symbolism is particularly salient and well-demonstrated with regards to cross-modal associations between language and the visual domain. In this work, we address the question of whether sound symbolism is reflected in vision-and-language models such as CLIP and Stable Diffusion. Using zero-shot knowledge probing to investigate the inherent knowledge of these models, we find strong evidence that they do show this pattern, paralleling the well-known kiki-bouba effect in psycholinguistics. Our work provides a novel method for demonstrating sound symbolism and understanding its nature using computational tools. Our code will be made publicly available.
Paper Structure (32 sections, 7 figures, 9 tables)

This paper contains 32 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Illustration of the kiki--bouba effect. The shapes on the far left illustrate stimuli used in the classic kiki--bouba experiment. The remaining images are random generations from Stable Diffusion with the prompt a 3D rendering of a $\left<w\right>$ shaped object, where $\left<w\right>$$\in$ {kiki, bouba}. Which of of these images do you think were generated using pseudoword kiki and which with bouba? See below for the answer.
  • Figure 2: Graphemes sorted by average geometric score $\gamma_{\text{$\left<w\right>$\xspace}}$ for pseudowords $\left<w\right>$ whose first syllable contains the given grapheme, calculated with Stable Diffusion and CLIP. Characters are colored based on their ground-truth association (red for $\largestar$, blue for $\bigcircle$). Consonants are shown above and vowels below the arrow. We see that the two classes are mostly well-discriminated by these scores, especially when calculated Stable Diffusion. In this visualization, consonants and vowels are displayed on separate scales and are not positioned absolutely with respect to each other.
  • Figure 3: Ground-truth adjectives sorted by phonetic score $\phi_{\text{$\left<w\right>$\xspace}}$, calculated with Stable Diffusion and CLIP. Adjectives are colored based on their ground-truth association (red for $\largestar$, blue for $\bigcircle$). We see that the two classes are highly differentiated by phonetic score for both models, as further reflected in the corresponding metrics in Table \ref{['tab:association_metrics']}.
  • Figure 4: Image generations for pseudowords with high (top 20%) and low (bottom 20%) geometric scores. We visualize random selections of pseudoword--image pairs for each category. Pseudowords with class ($\largestar$ or $\bigcircle$) that does not match its geometric score are indicated in red. As seen above, the shapes of the generated images noticeably correlate with the pseudoword class.
  • Figure 5: Images generated from pseudowords reminiscent of real English words. For each pseudoword we display an associated image generation and the automatically detected closest English word. Pseudowords with high or low geometric scores (in top or bottom 20% relative to all pseudowords) which do not match their class ($\largestar$ or $\bigcircle$) are indicated in red.
  • ...and 2 more figures