Preservation of Language Understanding Capabilities in Speech-aware Large Language Models
Marek Kubis, Paweł Skórzewski, Iwona Christop, Mateusz Czyżnikiewicz, Jakub Kubiak, Łukasz Bondaruk, Marcin Lewandowski
TL;DR
Preserving language understanding when LLMs are accessed via speech is a critical but under-explored problem. The authors introduce C3T, a cross-modal capabilities conservation test that uses a voice-cloning TTS engine to synthesize diverse speakers and evaluate textual tasks in speech form, focusing on fairness across speakers and cross-modal robustness. By curating a subset of Open LLM Leaderboard v2 and BIG-Bench-Hard tasks suitable for spoken delivery, and by defining speaker-aware metrics, they demonstrate that speech input can induce notable accuracy drops and uneven performance across demographic groups, even for strong models. The work provides a scalable framework for evaluating speech-aware LLMs and highlights cross-modal inconsistencies, underscoring the need for fairness-aware evaluation in multimodal NLP.
Abstract
The paper presents C3T (Cross-modal Capabilities Conservation Test), a new benchmark for assessing the performance of speech-aware large language models. The benchmark utilizes textual tasks and a voice cloning text-to-speech model to quantify the extent to which language understanding capabilities are preserved when the model is accessed via speech input. C3T quantifies the fairness of the model for different categories of speakers and its robustness across text and speech modalities.
