Measuring Sound Symbolism in Audio-visual Models

Wei-Cheng Tseng; Yi-Jen Shih; David Harwath; Raymond Mooney

Measuring Sound Symbolism in Audio-visual Models

Wei-Cheng Tseng, Yi-Jen Shih, David Harwath, Raymond Mooney

TL;DR

A significant correlation between the models’ outputs and established patterns of sound symbolism is revealed, particularly in models trained on speech data, providing insights into both cognitive architectures and machine learning strategies.

Abstract

Audio-visual pre-trained models have gained substantial attention recently and demonstrated superior performance on various audio-visual tasks. This study investigates whether pre-trained audio-visual models demonstrate non-arbitrary associations between sounds and visual representations$\unicode{x2013}$known as sound symbolism$\unicode{x2013}$which is also observed in humans. We developed a specialized dataset with synthesized images and audio samples and assessed these models using a non-parametric approach in a zero-shot setting. Our findings reveal a significant correlation between the models' outputs and established patterns of sound symbolism, particularly in models trained on speech data. These results suggest that such models can capture sound-meaning connections akin to human language processing, providing insights into both cognitive architectures and machine learning strategies.

Measuring Sound Symbolism in Audio-visual Models

TL;DR

Abstract

known as sound symbolism

which is also observed in humans. We developed a specialized dataset with synthesized images and audio samples and assessed these models using a non-parametric approach in a zero-shot setting. Our findings reveal a significant correlation between the models' outputs and established patterns of sound symbolism, particularly in models trained on speech data. These results suggest that such models can capture sound-meaning connections akin to human language processing, providing insights into both cognitive architectures and machine learning strategies.

Paper Structure (15 sections, 9 equations, 3 figures, 2 tables)

This paper contains 15 sections, 9 equations, 3 figures, 2 tables.

Introduction
Related Work
Sound Symbolism
Pre-trained Audio-visual Models
Interpreting Deep Learning Models
Dataset Collection
Generating Images
Synthesizing Audios
Evaluation Method
Experiments
Pre-trained Audio-Visual Models
Quantitative Results
Qualitative Results
Conclusion
Limitations

Figures (3)

Figure 1: Example of the kiki-bouba experiment: When hearing the names "kiki" and "bouba", people from various cultural and linguistic backgrounds typically label the left shape as "kiki" and the right one as "bouba".
Figure 2: Examples of generated images: the upper row is from the sharp image set $\mathcal{I}_{\text{sharp}}$, while the lower row is from the round image set $\mathcal{I}_{\text{round}}$.
Figure 3: Phones sorted by average geometric score grouped by the first syllable of the sounds. The colors indicates ground-truth association of each phone. (blue and $\circ$ refer to round group, while red and $\star$ represent sharp). Consonants and vowels are displayed on separate scales but are positioned absolutely to each other within each scale.

Measuring Sound Symbolism in Audio-visual Models

TL;DR

Abstract

Measuring Sound Symbolism in Audio-visual Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)