Table of Contents
Fetching ...

BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation

Chen Wang, Minpeng Liao, Zhongqiang Huang, Jiajun Zhang

TL;DR

This work tackles extending large language models to spoken language by addressing speech-text alignment and fine-grained lexical mapping. It introduces BLSP-KD, which combines a knowledge-distillation objective over next-token predictions with a CIF-based CFormer adapter to achieve one-to-one speech-text token alignment, plus Partial LoRA to selectively fine-tune the LLM for the speech modality. Empirically, BLSP-KD outperforms the previous end-to-end baseline and comparable cascaded systems on speech translation and general QA, with additional gains from unfreezing the speech encoder and LLM via PLoRA in certain settings; however, gaps remain relative to strong ASR-backed cascaded baselines due to data scale differences. Overall, the approach demonstrates a promising end-to-end pathway for instruction-following with speech and lays groundwork for incorporating paralinguistic cues in future work.

Abstract

Recent end-to-end approaches have shown promise in extending large language models (LLMs) to speech inputs, but face limitations in directly assessing and optimizing alignment quality and fail to achieve fine-grained alignment due to speech-text length mismatch. We introduce BLSP-KD, a novel approach for Bootstrapping Language-Speech Pretraining via Knowledge Distillation, which addresses these limitations through two key techniques. First, it optimizes speech-text alignment by minimizing the divergence between the LLM's next-token prediction distributions for speech and text inputs using knowledge distillation. Second, it employs a continuous-integrate-andfire strategy to segment speech into tokens that correspond one-to-one with text tokens, enabling fine-grained alignment. We also introduce Partial LoRA (PLoRA), a new adaptation method supporting LLM finetuning for speech inputs under knowledge distillation. Quantitative evaluation shows that BLSP-KD outperforms previous end-to-end baselines and cascaded systems with comparable scale of parameters, facilitating general instruction-following capabilities for LLMs with speech inputs. This approach provides new possibilities for extending LLMs to spoken language interactions.

BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation

TL;DR

This work tackles extending large language models to spoken language by addressing speech-text alignment and fine-grained lexical mapping. It introduces BLSP-KD, which combines a knowledge-distillation objective over next-token predictions with a CIF-based CFormer adapter to achieve one-to-one speech-text token alignment, plus Partial LoRA to selectively fine-tune the LLM for the speech modality. Empirically, BLSP-KD outperforms the previous end-to-end baseline and comparable cascaded systems on speech translation and general QA, with additional gains from unfreezing the speech encoder and LLM via PLoRA in certain settings; however, gaps remain relative to strong ASR-backed cascaded baselines due to data scale differences. Overall, the approach demonstrates a promising end-to-end pathway for instruction-following with speech and lays groundwork for incorporating paralinguistic cues in future work.

Abstract

Recent end-to-end approaches have shown promise in extending large language models (LLMs) to speech inputs, but face limitations in directly assessing and optimizing alignment quality and fail to achieve fine-grained alignment due to speech-text length mismatch. We introduce BLSP-KD, a novel approach for Bootstrapping Language-Speech Pretraining via Knowledge Distillation, which addresses these limitations through two key techniques. First, it optimizes speech-text alignment by minimizing the divergence between the LLM's next-token prediction distributions for speech and text inputs using knowledge distillation. Second, it employs a continuous-integrate-andfire strategy to segment speech into tokens that correspond one-to-one with text tokens, enabling fine-grained alignment. We also introduce Partial LoRA (PLoRA), a new adaptation method supporting LLM finetuning for speech inputs under knowledge distillation. Quantitative evaluation shows that BLSP-KD outperforms previous end-to-end baselines and cascaded systems with comparable scale of parameters, facilitating general instruction-following capabilities for LLMs with speech inputs. This approach provides new possibilities for extending LLMs to spoken language interactions.
Paper Structure (23 sections, 8 equations, 3 figures, 6 tables)

This paper contains 23 sections, 8 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: An overview of the BLSP-KD approach. The next token prediction distributions based on text input, as computed by the LLM in the teacher pass on speech transcripts, are used as supervision in the student pass to train the modality adapter (CFormer). This allows the LLM to produce similar next token prediction distributions for both the input (orange) and response (blue) tokens given speech input. Note that in this figure, the LLM and Speech Encoder are kept frozen, as this is the setting used in most experiments; however, they can also be fine-tuned.
  • Figure 2: An illustration of the CFormer adapter for mapping the speech encoder's feature representation $\mathbf{s}^\text{enc}$ of length $l$ to hidden states $\mathbf{s}^\text{adp}$ of length $n$ as inputs to the LLM. The $\alpha$ values are first computed for each of the $l$ hidden states in $\mathbf{s}^\text{pre}$ and then distributed among the number of text tokens to compute the $n$ hidden states in $\mathbf{s}^\text{cif}$.
  • Figure 3: The illustration of PLoRA. LoRA is only applied to speech tokens, while the encoding of text tokens remains unaffected.