Cross-modal Associations in Vision and Language Models: Revisiting the Bouba-Kiki Effect
Tom Kouwenhoven, Kiana Shahrasbi, Tessa Verhoef
TL;DR
This work re-evaluates the bouba-kiki cross-modal effect in two CLIP backbones (ResNet-50 and ViT) using human-aligned prompts and a Grad-CAM interpretability approach. By pairing carefully controlled curved vs jagged shapes with diverse pseudowords and adjectives, the authors show that neither backbone consistently exhibits human-like shape–word associations, especially for novel pseudowords. Through Bayesian regression analyses and Grad-CAM, they demonstrate that model predictions largely reflect chance and that attention does not target the expected shape features, highlighting a gap between VLM representations and human cognition. The findings emphasize limitations in current vision–language grounding and invite future work on embodied, cross-linguistic, and interpretable grounding to achieve more human-like cross-modal understanding.
Abstract
Recent advances in multimodal models have raised questions about whether vision-and-language models (VLMs) integrate cross-modal information in ways that reflect human cognition. One well-studied test case in this domain is the bouba-kiki effect, where humans reliably associate pseudowords like `bouba' with round shapes and `kiki' with jagged ones. Given the mixed evidence found in prior studies for this effect in VLMs, we present a comprehensive re-evaluation focused on two variants of CLIP, ResNet and Vision Transformer (ViT), given their centrality in many state-of-the-art VLMs. We apply two complementary methods closely modelled after human experiments: a prompt-based evaluation that uses probabilities as a measure of model preference, and we use Grad-CAM as a novel approach to interpret visual attention in shape-word matching tasks. Our findings show that these model variants do not consistently exhibit the bouba-kiki effect. While ResNet shows a preference for round shapes, overall performance across both model variants lacks the expected associations. Moreover, direct comparison with prior human data on the same task shows that the models' responses fall markedly short of the robust, modality-integrated behaviour characteristic of human cognition. These results contribute to the ongoing debate about the extent to which VLMs truly understand cross-modal concepts, highlighting limitations in their internal representations and alignment with human intuitions.
