Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing

Viet Anh Trinh; Rosy Southwell; Yiwen Guan; Xinlu He; Zhiyong Wang; Jacob Whitehill

Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing

Viet Anh Trinh, Rosy Southwell, Yiwen Guan, Xinlu He, Zhiyong Wang, Jacob Whitehill

TL;DR

This work addresses the challenge of unifying speech, text, and vision processing within a single discrete-token framework by leveraging pretrained large language models. It introduces DMLM, a decoder-only Transformer that extends the LLM vocabulary with modality-specific tokens ($\\mathcal{T}$, $\\mathcal{S}$, $\\mathcal{I}$) and trains via a length-normalized tri-modal loss $\\mathcal{L}$ to balance token counts across modalities, enabling ASR, T2S, S2TT, and I2T. Key findings show that pretraining the LLM and employing mixed supervision with unsupervised data improve performance across tasks, and that Whisper-based codebooks yield more effective discrete units for ASR than XLS-R or Seamless-based options. The results suggest a scalable path toward cross-modal translation with strong linguistic priors and point to future directions in audiovisual processing and token-codebook design.

Abstract

Recent work on discrete speech tokenization has paved the way for models that can seamlessly perform multiple tasks across modalities, e.g., speech recognition, text to speech, speech to speech translation. Moreover, large language models (LLMs) pretrained from vast text corpora contain rich linguistic information that can improve accuracy in a variety of tasks. In this paper, we present a decoder-only Discrete Multimodal Language Model (DMLM), which can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision). We explore several critical aspects of discrete multi-modal models, including the loss function, weight initialization, mixed training supervision, and codebook. Our results show that DMLM benefits significantly, across multiple tasks and datasets, from a combination of supervised and unsupervised training. Moreover, for ASR, it benefits from initializing DMLM from a pretrained LLM, and from a codebook derived from Whisper activations.

Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing

TL;DR

) and trains via a length-normalized tri-modal loss

to balance token counts across modalities, enabling ASR, T2S, S2TT, and I2T. Key findings show that pretraining the LLM and employing mixed supervision with unsupervised data improve performance across tasks, and that Whisper-based codebooks yield more effective discrete units for ASR than XLS-R or Seamless-based options. The results suggest a scalable path toward cross-modal translation with strong linguistic priors and point to future directions in audiovisual processing and token-codebook design.

Abstract

Paper Structure (9 sections, 1 equation, 2 figures, 8 tables)

This paper contains 9 sections, 1 equation, 2 figures, 8 tables.

Introduction
Related Work
Discrete Multimodal Language Model
Experiments
Experiment 1: Tri-Modal Loss Function
Experiment 2: Pretraining the LLM
Experiment 3: Mixed-Supervision Training
Experiment 4: Codebook
Discussion and Conclusions

Figures (2)

Figure 1: Discrete Multimodal Language Model (DMLM): The input (text, speech, image) is tokenized by a modality-specific tokenizer/codec. The discrete token sequence ${\bf x}_1,{\bf x}_2,\ldots$ is concatenated to a task description and then passed to DMLM for processing. Training is conducted based on next-token prediction. Inference can be performed for a variety of tasks (ASR, S2TT, T2S, I2T, etc.).
Figure 2: Example inputs/outputs of a trained DMLM.

Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing

TL;DR

Abstract

Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing

Authors

TL;DR

Abstract

Table of Contents

Figures (2)