Table of Contents
Fetching ...

Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models

Tolúlopé Ògúnrèmí, Christopher D. Manning, Dan Jurafsky, Karen Livescu

TL;DR

The paper investigates how modality adapters in spoken language models reshape speech encoder outputs, proposing a model-agnostic token-nearest-neighbor method to categorize MA content as transcription, translation, semantic representation, or transliteration. By applying this approach to SALMONN, Qwen2-Audio, and Phi-4-Multimodal-Instruct across multilingual data, the authors reveal two dominant strategies: Whisper-based MAs tend to produce English-based interlingua or semantic representations, while non-Whisper MAs reproduce English phonetics via phonetic mappings. Complementary linear probing shows Whisper-based MAs lose low-level phonetic/lexical detail but gain semantic information, whereas Phi-4-MI MAs improve both word/phone accuracy and semantics. The study provides a principled, extensible framework for interpreting MA behavior in SLMs, with implications for designing multilingual speech-language systems and understanding cross-lingual transfer.

Abstract

Spoken language models (SLMs) that integrate speech with large language models (LMs) rely on modality adapters (MAs) to map the output of speech encoders to a representation that is understandable to the decoder LM. Yet we know very little about how these crucial MAs transform representations. Here we examine the MA output representation in three SLMs (SALMONN, Qwen2-Audio and Phi-4-Multimodal-Instruct). By finding the nearest decoder LM token to an MA representation, we uncover two strategies for MA representations. For models using a Whisper encoder, MAs appear to represent the meaning of the input using an English-based interlingua, allowing them to handle languages unseen in instruction tuning. For models that don't, like Phi-4-Multimodal-Instruct, MAs instead represent the phonetics of the input, but expressed with English words. We hypothesise that which arises depends on whether the speech encoder is trained only for speech recognition or also for translation.

Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models

TL;DR

The paper investigates how modality adapters in spoken language models reshape speech encoder outputs, proposing a model-agnostic token-nearest-neighbor method to categorize MA content as transcription, translation, semantic representation, or transliteration. By applying this approach to SALMONN, Qwen2-Audio, and Phi-4-Multimodal-Instruct across multilingual data, the authors reveal two dominant strategies: Whisper-based MAs tend to produce English-based interlingua or semantic representations, while non-Whisper MAs reproduce English phonetics via phonetic mappings. Complementary linear probing shows Whisper-based MAs lose low-level phonetic/lexical detail but gain semantic information, whereas Phi-4-MI MAs improve both word/phone accuracy and semantics. The study provides a principled, extensible framework for interpreting MA behavior in SLMs, with implications for designing multilingual speech-language systems and understanding cross-lingual transfer.

Abstract

Spoken language models (SLMs) that integrate speech with large language models (LMs) rely on modality adapters (MAs) to map the output of speech encoders to a representation that is understandable to the decoder LM. Yet we know very little about how these crucial MAs transform representations. Here we examine the MA output representation in three SLMs (SALMONN, Qwen2-Audio and Phi-4-Multimodal-Instruct). By finding the nearest decoder LM token to an MA representation, we uncover two strategies for MA representations. For models using a Whisper encoder, MAs appear to represent the meaning of the input using an English-based interlingua, allowing them to handle languages unseen in instruction tuning. For models that don't, like Phi-4-Multimodal-Instruct, MAs instead represent the phonetics of the input, but expressed with English words. We hypothesise that which arises depends on whether the speech encoder is trained only for speech recognition or also for translation.

Paper Structure

This paper contains 12 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Generic spoken language model architecture: A spoken language model uses a speech encoder to embed speech, learns soft tokens with a modality adapter and concatenates them with a language model text prompt.
  • Figure 2: A summary of our token-level analysis method up to Step 2. We align words and tokens, translate the transcription and do language identification of the tokens. From top to bottom, the rows correspond to: the alignment of the ground truth transcription to the audio (obtained using the Montreal Forced Aligner); the corresponding aligned translation (obtained with Awesome Align); the sequence of nearest-neighbour LM tokens; and the language identified for each token (by the Google Translate API). The nearest neighbour token sets resulting from this example alignment are given in Table \ref{['tab:walkthrough']}.
  • Figure 3: A trace at Step 3d of our token-level analysis method. For each token aligned to a specific word, we use Epitran epitran to obtain a phonetic transcription of the word. If the aligned tokens have at least half of the phones in the phonetic transcription of the word, as in this case 2 of the 4 phones of "vivez" are present, we determine that it is a phonetic representation of the speech.
  • Figure 4: Token language distribution of SALMONN, Qwen2-Audio, and Phi-4 Multimodal Instruct.
  • Figure 5: Word verdict of decipherable tokens for SALMONN, Qwen2-Audio, and Phi-4-Multimodal-Instruct.