Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models
Tolúlopé Ògúnrèmí, Christopher D. Manning, Dan Jurafsky, Karen Livescu
TL;DR
The paper investigates how modality adapters in spoken language models reshape speech encoder outputs, proposing a model-agnostic token-nearest-neighbor method to categorize MA content as transcription, translation, semantic representation, or transliteration. By applying this approach to SALMONN, Qwen2-Audio, and Phi-4-Multimodal-Instruct across multilingual data, the authors reveal two dominant strategies: Whisper-based MAs tend to produce English-based interlingua or semantic representations, while non-Whisper MAs reproduce English phonetics via phonetic mappings. Complementary linear probing shows Whisper-based MAs lose low-level phonetic/lexical detail but gain semantic information, whereas Phi-4-MI MAs improve both word/phone accuracy and semantics. The study provides a principled, extensible framework for interpreting MA behavior in SLMs, with implications for designing multilingual speech-language systems and understanding cross-lingual transfer.
Abstract
Spoken language models (SLMs) that integrate speech with large language models (LMs) rely on modality adapters (MAs) to map the output of speech encoders to a representation that is understandable to the decoder LM. Yet we know very little about how these crucial MAs transform representations. Here we examine the MA output representation in three SLMs (SALMONN, Qwen2-Audio and Phi-4-Multimodal-Instruct). By finding the nearest decoder LM token to an MA representation, we uncover two strategies for MA representations. For models using a Whisper encoder, MAs appear to represent the meaning of the input using an English-based interlingua, allowing them to handle languages unseen in instruction tuning. For models that don't, like Phi-4-Multimodal-Instruct, MAs instead represent the phonetics of the input, but expressed with English words. We hypothesise that which arises depends on whether the speech encoder is trained only for speech recognition or also for translation.
