Table of Contents
Fetching ...

What does Kiki look like? Cross-modal associations between speech sounds and visual shapes in vision-and-language models

Tessa Verhoef, Kiana Shahrasbi, Tom Kouwenhoven

TL;DR

The paper probes whether four vision-language models (CLIP, BLIP2, ViLT, and GPT-4o) encode bouba-kiki cross-modal associations akin to humans. By adapting human experimental paradigms with image sets and pseudowords, and analyzing image-to-text probabilities via Bayesian methods, the authors assess if VLMs align speech-like sounds with curved versus jagged shapes. Results show limited and highly model-dependent evidence for bouba-kiki, with CLIP and GPT-4o demonstrating some alignment under specific test designs, while BLIP2 and ViLT largely exhibit no robust effect. The findings suggest that cross-modal sound-symbolism in VLMs is not guaranteed and hinges on architecture, training data, and prompting, carrying implications for modeling language emergence and human-AI interactions.

Abstract

Humans have clear cross-modal preferences when matching certain novel words to visual shapes. Evidence suggests that these preferences play a prominent role in our linguistic processing, language learning, and the origins of signal-meaning mappings. With the rise of multimodal models in AI, such as vision- and-language (VLM) models, it becomes increasingly important to uncover the kinds of visio-linguistic associations these models encode and whether they align with human representations. Informed by experiments with humans, we probe and compare four VLMs for a well-known human cross-modal preference, the bouba-kiki effect. We do not find conclusive evidence for this effect but suggest that results may depend on features of the models, such as architecture design, model size, and training details. Our findings inform discussions on the origins of the bouba-kiki effect in human cognition and future developments of VLMs that align well with human cross-modal associations.

What does Kiki look like? Cross-modal associations between speech sounds and visual shapes in vision-and-language models

TL;DR

The paper probes whether four vision-language models (CLIP, BLIP2, ViLT, and GPT-4o) encode bouba-kiki cross-modal associations akin to humans. By adapting human experimental paradigms with image sets and pseudowords, and analyzing image-to-text probabilities via Bayesian methods, the authors assess if VLMs align speech-like sounds with curved versus jagged shapes. Results show limited and highly model-dependent evidence for bouba-kiki, with CLIP and GPT-4o demonstrating some alignment under specific test designs, while BLIP2 and ViLT largely exhibit no robust effect. The findings suggest that cross-modal sound-symbolism in VLMs is not guaranteed and hinges on architecture, training data, and prompting, carrying implications for modeling language emergence and human-AI interactions.

Abstract

Humans have clear cross-modal preferences when matching certain novel words to visual shapes. Evidence suggests that these preferences play a prominent role in our linguistic processing, language learning, and the origins of signal-meaning mappings. With the rise of multimodal models in AI, such as vision- and-language (VLM) models, it becomes increasingly important to uncover the kinds of visio-linguistic associations these models encode and whether they align with human representations. Informed by experiments with humans, we probe and compare four VLMs for a well-known human cross-modal preference, the bouba-kiki effect. We do not find conclusive evidence for this effect but suggest that results may depend on features of the models, such as architecture design, model size, and training details. Our findings inform discussions on the origins of the bouba-kiki effect in human cognition and future developments of VLMs that align well with human cross-modal associations.
Paper Structure (22 sections, 10 figures, 1 table)

This paper contains 22 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: Which of these two shapes is Kiki? Images from kohler1929gestaltkohler1947gestalt
  • Figure 2: Percentages of trials in which selected syllables contain sonorant consonants or rounded vowels, separated by image shape (Jagged or Curved) for all four VLMs
  • Figure 3: Probability scores for the original pseudowords (bouba, kiki, takete and maluma), as well as for the four different generated syllable types: Sonorant-Rounded (S-R), Sonorant-Non-Rounded (S-NR), Plosive-Rounded (P-R) and Plosive-Non-Rounded (P-NR), paired with two types of shapes (Jagged or Curved) for three VLMs
  • Figure 4: Percentages of trials in which Jagged or Curved visual shapes were matched to Sonorant-Rounded (S-R) syllables embedded in two-syllable pseudowords for all VLMs. Here 0% for S-R syllables implies a 100% preference for P-NR syllables.
  • Figure 5: Probability scores for four pseudoword types, combining Sonorant-Rounded (S-R) and Plosive-Non-Rounded (P-NR) syllables, paired with two types of shapes (Jagged or Curved) for three VLMs
  • ...and 5 more figures