Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing
Viet Anh Trinh, Rosy Southwell, Yiwen Guan, Xinlu He, Zhiyong Wang, Jacob Whitehill
TL;DR
This work addresses the challenge of unifying speech, text, and vision processing within a single discrete-token framework by leveraging pretrained large language models. It introduces DMLM, a decoder-only Transformer that extends the LLM vocabulary with modality-specific tokens ($\\mathcal{T}$, $\\mathcal{S}$, $\\mathcal{I}$) and trains via a length-normalized tri-modal loss $\\mathcal{L}$ to balance token counts across modalities, enabling ASR, T2S, S2TT, and I2T. Key findings show that pretraining the LLM and employing mixed supervision with unsupervised data improve performance across tasks, and that Whisper-based codebooks yield more effective discrete units for ASR than XLS-R or Seamless-based options. The results suggest a scalable path toward cross-modal translation with strong linguistic priors and point to future directions in audiovisual processing and token-codebook design.
Abstract
Recent work on discrete speech tokenization has paved the way for models that can seamlessly perform multiple tasks across modalities, e.g., speech recognition, text to speech, speech to speech translation. Moreover, large language models (LLMs) pretrained from vast text corpora contain rich linguistic information that can improve accuracy in a variety of tasks. In this paper, we present a decoder-only Discrete Multimodal Language Model (DMLM), which can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision). We explore several critical aspects of discrete multi-modal models, including the loss function, weight initialization, mixed training supervision, and codebook. Our results show that DMLM benefits significantly, across multiple tasks and datasets, from a combination of supervised and unsupervised training. Moreover, for ASR, it benefits from initializing DMLM from a pretrained LLM, and from a codebook derived from Whisper activations.
