Table of Contents
Fetching ...

Seeing isn't Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms

Tyler Loakman, Joseph James, Chenghua Lin

TL;DR

This work probes whether vision-language models can interpret speech visualisations by introducing a phonetically informed spectrogram/waveform interpretation benchmark. It constructs a 4k+-word dataset with phonemic and graphemic multiple-choice tasks, varying input modality (spectrogram vs spectrogram+waveform) and requires distractors chosen via phonemic edit distance. Across zero-shot and finetuned models, results show near-chance performance, while automatic speech recognition dramatically outperforms VLMs on the same data, and human phoneticians approach higher accuracy but still reveal a gap. The study highlights the need for explicit phonetic knowledge within multimodal models and provides a rigorous framework for future work on integrating phoneme-level priors with visual inputs.

Abstract

With the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in tasks that fuse the modalities of vision and language. In this work, we benchmark the extent to which VLMs are able to act as highly-trained phoneticians, interpreting spectrograms and waveforms of speech. To do this, we synthesise a novel dataset containing 4k+ English words spoken in isolation alongside stylistically consistent spectrogram and waveform figures. We test the ability of VLMs to understand these representations of speech through a multiple-choice task whereby models must predict the correct phonemic or graphemic transcription of a spoken word when presented amongst 3 distractor transcriptions that have been selected based on their phonemic edit distance to the ground truth. We observe that both zero-shot and finetuned models rarely perform above chance, demonstrating the requirement for specific parametric knowledge of how to interpret such figures, rather than paired samples alone.

Seeing isn't Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms

TL;DR

This work probes whether vision-language models can interpret speech visualisations by introducing a phonetically informed spectrogram/waveform interpretation benchmark. It constructs a 4k+-word dataset with phonemic and graphemic multiple-choice tasks, varying input modality (spectrogram vs spectrogram+waveform) and requires distractors chosen via phonemic edit distance. Across zero-shot and finetuned models, results show near-chance performance, while automatic speech recognition dramatically outperforms VLMs on the same data, and human phoneticians approach higher accuracy but still reveal a gap. The study highlights the need for explicit phonetic knowledge within multimodal models and provides a rigorous framework for future work on integrating phoneme-level priors with visual inputs.

Abstract

With the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in tasks that fuse the modalities of vision and language. In this work, we benchmark the extent to which VLMs are able to act as highly-trained phoneticians, interpreting spectrograms and waveforms of speech. To do this, we synthesise a novel dataset containing 4k+ English words spoken in isolation alongside stylistically consistent spectrogram and waveform figures. We test the ability of VLMs to understand these representations of speech through a multiple-choice task whereby models must predict the correct phonemic or graphemic transcription of a spoken word when presented amongst 3 distractor transcriptions that have been selected based on their phonemic edit distance to the ground truth. We observe that both zero-shot and finetuned models rarely perform above chance, demonstrating the requirement for specific parametric knowledge of how to interpret such figures, rather than paired samples alone.

Paper Structure

This paper contains 21 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: An example waveform (top) and spectrogram (bottom) of "activation" spoken by a text-to-speech model.
  • Figure 2: Zero-shot results in the multiple choice spectrogram interpretation task. Graphemic refers to questions where the options were presented in their standard written English form, whilst Phonemic refers to questions where the options were written in the International Phonetic Alphabet. Spectrogram and Spectrogram + Waveform refer to the type of figure presented to the VLM. Accuracy refers to the % of the time that the correct answer was selected, whilst Phonemic Edit Distance refers to the average distance of the selected option in comparison to the correct answer. The solid horizontal line in the Accuracy plot presents chance level agreement (25%), whilst the dashed lines in the phonemic distance plot relate to the expected phonemic distance for consistently selecting the 2nd, 3rd or 4th most similar option, whilst the solid blue line represents what is expected from random selection.
  • Figure 3: Finetuned results in the multiple choice spectrogram/waveform interpretation task. Please refer to \ref{['fig:zero-shot']} for axis/condition information.
  • Figure 4: Distribution of word lengths (as determined via phoneme count) and individual phonemes across the training, development and test sets for finetuned VLMs.