Seeing isn't Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms
Tyler Loakman, Joseph James, Chenghua Lin
TL;DR
This work probes whether vision-language models can interpret speech visualisations by introducing a phonetically informed spectrogram/waveform interpretation benchmark. It constructs a 4k+-word dataset with phonemic and graphemic multiple-choice tasks, varying input modality (spectrogram vs spectrogram+waveform) and requires distractors chosen via phonemic edit distance. Across zero-shot and finetuned models, results show near-chance performance, while automatic speech recognition dramatically outperforms VLMs on the same data, and human phoneticians approach higher accuracy but still reveal a gap. The study highlights the need for explicit phonetic knowledge within multimodal models and provides a rigorous framework for future work on integrating phoneme-level priors with visual inputs.
Abstract
With the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in tasks that fuse the modalities of vision and language. In this work, we benchmark the extent to which VLMs are able to act as highly-trained phoneticians, interpreting spectrograms and waveforms of speech. To do this, we synthesise a novel dataset containing 4k+ English words spoken in isolation alongside stylistically consistent spectrogram and waveform figures. We test the ability of VLMs to understand these representations of speech through a multiple-choice task whereby models must predict the correct phonemic or graphemic transcription of a spoken word when presented amongst 3 distractor transcriptions that have been selected based on their phonemic edit distance to the ground truth. We observe that both zero-shot and finetuned models rarely perform above chance, demonstrating the requirement for specific parametric knowledge of how to interpret such figures, rather than paired samples alone.
