Kiki or Bouba? Sound Symbolism in Vision-and-Language Models
Morris Alper, Hadar Averbuch-Elor
TL;DR
The paper demonstrates that vision-language models harbor non-arbitrary sound-symbolic mappings, akin to the kiki-bouba effect, by constructing pseudowords and probing CLIP and Stable Diffusion in a zero-shot setting. Using prompts and a multimodal embedding framework, the authors define geometric and phonetic scores to quantify associations between written sounds and visual properties, reporting robust discriminative metrics and a human perceptual alignment. The work provides a computational approach to sound symbolism, reveals emergent surface-form knowledge in multimodal encoders, and discusses implications for cognitive science and interpretability while acknowledging dataset- and language-specific caveats and proposing avenues for multilingual exploration.
Abstract
Although the mapping between sound and meaning in human language is assumed to be largely arbitrary, research in cognitive science has shown that there are non-trivial correlations between particular sounds and meanings across languages and demographic groups, a phenomenon known as sound symbolism. Among the many dimensions of meaning, sound symbolism is particularly salient and well-demonstrated with regards to cross-modal associations between language and the visual domain. In this work, we address the question of whether sound symbolism is reflected in vision-and-language models such as CLIP and Stable Diffusion. Using zero-shot knowledge probing to investigate the inherent knowledge of these models, we find strong evidence that they do show this pattern, paralleling the well-known kiki-bouba effect in psycholinguistics. Our work provides a novel method for demonstrating sound symbolism and understanding its nature using computational tools. Our code will be made publicly available.
