Table of Contents
Fetching ...

Cross-modal Associations in Vision and Language Models: Revisiting the Bouba-Kiki Effect

Tom Kouwenhoven, Kiana Shahrasbi, Tessa Verhoef

TL;DR

This work re-evaluates the bouba-kiki cross-modal effect in two CLIP backbones (ResNet-50 and ViT) using human-aligned prompts and a Grad-CAM interpretability approach. By pairing carefully controlled curved vs jagged shapes with diverse pseudowords and adjectives, the authors show that neither backbone consistently exhibits human-like shape–word associations, especially for novel pseudowords. Through Bayesian regression analyses and Grad-CAM, they demonstrate that model predictions largely reflect chance and that attention does not target the expected shape features, highlighting a gap between VLM representations and human cognition. The findings emphasize limitations in current vision–language grounding and invite future work on embodied, cross-linguistic, and interpretable grounding to achieve more human-like cross-modal understanding.

Abstract

Recent advances in multimodal models have raised questions about whether vision-and-language models (VLMs) integrate cross-modal information in ways that reflect human cognition. One well-studied test case in this domain is the bouba-kiki effect, where humans reliably associate pseudowords like `bouba' with round shapes and `kiki' with jagged ones. Given the mixed evidence found in prior studies for this effect in VLMs, we present a comprehensive re-evaluation focused on two variants of CLIP, ResNet and Vision Transformer (ViT), given their centrality in many state-of-the-art VLMs. We apply two complementary methods closely modelled after human experiments: a prompt-based evaluation that uses probabilities as a measure of model preference, and we use Grad-CAM as a novel approach to interpret visual attention in shape-word matching tasks. Our findings show that these model variants do not consistently exhibit the bouba-kiki effect. While ResNet shows a preference for round shapes, overall performance across both model variants lacks the expected associations. Moreover, direct comparison with prior human data on the same task shows that the models' responses fall markedly short of the robust, modality-integrated behaviour characteristic of human cognition. These results contribute to the ongoing debate about the extent to which VLMs truly understand cross-modal concepts, highlighting limitations in their internal representations and alignment with human intuitions.

Cross-modal Associations in Vision and Language Models: Revisiting the Bouba-Kiki Effect

TL;DR

This work re-evaluates the bouba-kiki cross-modal effect in two CLIP backbones (ResNet-50 and ViT) using human-aligned prompts and a Grad-CAM interpretability approach. By pairing carefully controlled curved vs jagged shapes with diverse pseudowords and adjectives, the authors show that neither backbone consistently exhibits human-like shape–word associations, especially for novel pseudowords. Through Bayesian regression analyses and Grad-CAM, they demonstrate that model predictions largely reflect chance and that attention does not target the expected shape features, highlighting a gap between VLM representations and human cognition. The findings emphasize limitations in current vision–language grounding and invite future work on embodied, cross-linguistic, and interpretable grounding to achieve more human-like cross-modal understanding.

Abstract

Recent advances in multimodal models have raised questions about whether vision-and-language models (VLMs) integrate cross-modal information in ways that reflect human cognition. One well-studied test case in this domain is the bouba-kiki effect, where humans reliably associate pseudowords like `bouba' with round shapes and `kiki' with jagged ones. Given the mixed evidence found in prior studies for this effect in VLMs, we present a comprehensive re-evaluation focused on two variants of CLIP, ResNet and Vision Transformer (ViT), given their centrality in many state-of-the-art VLMs. We apply two complementary methods closely modelled after human experiments: a prompt-based evaluation that uses probabilities as a measure of model preference, and we use Grad-CAM as a novel approach to interpret visual attention in shape-word matching tasks. Our findings show that these model variants do not consistently exhibit the bouba-kiki effect. While ResNet shows a preference for round shapes, overall performance across both model variants lacks the expected associations. Moreover, direct comparison with prior human data on the same task shows that the models' responses fall markedly short of the robust, modality-integrated behaviour characteristic of human cognition. These results contribute to the ongoing debate about the extent to which VLMs truly understand cross-modal concepts, highlighting limitations in their internal representations and alignment with human intuitions.

Paper Structure

This paper contains 23 sections, 10 figures, 5 tables.

Figures (10)

  • Figure 1: An overview of the two complementary methods used. On the left, we calculate the probabilities for each label across the four original pseudowords (note that the number of labels varies per label source) for each image shape and select the label with the highest probability (values are exemplary). On the right, we use concatenated image pairs and their labels as targets to calculate attention patterns with Grad-CAM and select the shape with the highest sum of attention.
  • Figure 2: The proportion of congruent responses for matching both images of an image pair correctly ($Match = 1$). A result in which models consistently match images above chance (the grey dashed line) across prompts would suggest the presence of cross-modal associations. This is only the case for the English adjectives, which function as a baseline. Model: $Match \sim 1 + Word\_type + (1 + Word\_type | Prompt)$. Diamonds are descriptive means, and the dots are posterior means.
  • Figure 3: The proportion of correctly matched responses for both labels of a label pair ($Correct = 1$) given an image pair using Grad-CAM. The green line indicates human 'performance' reported in cwiek2022acrossculture. The grey line shows the chance level (25%) as the model must map 'bouba' and 'kiki' correctly. Model: $Correct \sim 1 + LabelPair + (1 + LabelPair | Prompt)$. Diamonds are descriptive means, and the dots are posterior means.
  • Figure 4: The proportion of correct matches (CorrectProportion) given a word type and category (i.e., curved or sharp). A cross-modal association is indicated when a model consistently matches images above chance (grey line) for both categories—observed only with ViT and the English adjectives used for comparison. Model: $CorrectProportion \sim 1 + Word\_type + Category + (1 + Word\_type + Category | Prompt)$. The diamonds are descriptive averages, and the dots are posterior means.
  • Figure 5: t-SNE plot showing how the language models of different CLIP variants interpret labels from different categories. The colour shades indicate which word type a label in a category belongs to. In order to correctly match labels to images with shape-specific features, a model must be able to discriminate word types between labels of the same category. This is clearly possible. This plot shows the embeddings for the prompt: The label for this image is <label>. Different plots result in similar distributions.
  • ...and 5 more figures