Table of Contents
Fetching ...

Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models

Chi-Yuan Hsiao, Ke-Han Lu, Kai-Wei Chang, Chih-Kai Yang, Wei-Chih Chen, Hung-yi Lee

TL;DR

The paper investigates catastrophic forgetting during end-to-end training of Spoken Language Models (SLMs) that adapt pre-trained LLMs to speech via multi-stage tasks (ASR, TTS, SQA). It evaluates three mitigation strategies—model merging, discounting the LoRA scaling factor, and experience replay—with experiments showing that experience replay is the most effective, and that combining it with other strategies yields additional gains. By analyzing forgetting in text-based QA and instruction-following tasks alongside speech-based QA, the work demonstrates the potential of replay-based approaches to preserve prior knowledge while acquiring new speech capabilities. These findings provide actionable guidance for designing more robust and efficient SLM training pipelines across multimodal tasks.

Abstract

End-to-end training of Spoken Language Models (SLMs) commonly involves adapting pre-trained text-based Large Language Models (LLMs) to the speech modality through multi-stage training on diverse tasks such as ASR, TTS and spoken question answering (SQA). Although this multi-stage continual learning equips LLMs with both speech understanding and generation capabilities, the substantial differences in task and data distributions across stages can lead to catastrophic forgetting, where previously acquired knowledge is lost. This paper investigates catastrophic forgetting and evaluates three mitigation strategies-model merging, discounting the LoRA scaling factor, and experience replay to balance knowledge retention with new learning. Results show that experience replay is the most effective, with further gains achieved by combining it with other methods. These findings provide insights for developing more robust and efficient SLM training pipelines.

Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models

TL;DR

The paper investigates catastrophic forgetting during end-to-end training of Spoken Language Models (SLMs) that adapt pre-trained LLMs to speech via multi-stage tasks (ASR, TTS, SQA). It evaluates three mitigation strategies—model merging, discounting the LoRA scaling factor, and experience replay—with experiments showing that experience replay is the most effective, and that combining it with other strategies yields additional gains. By analyzing forgetting in text-based QA and instruction-following tasks alongside speech-based QA, the work demonstrates the potential of replay-based approaches to preserve prior knowledge while acquiring new speech capabilities. These findings provide actionable guidance for designing more robust and efficient SLM training pipelines across multimodal tasks.

Abstract

End-to-end training of Spoken Language Models (SLMs) commonly involves adapting pre-trained text-based Large Language Models (LLMs) to the speech modality through multi-stage training on diverse tasks such as ASR, TTS and spoken question answering (SQA). Although this multi-stage continual learning equips LLMs with both speech understanding and generation capabilities, the substantial differences in task and data distributions across stages can lead to catastrophic forgetting, where previously acquired knowledge is lost. This paper investigates catastrophic forgetting and evaluates three mitigation strategies-model merging, discounting the LoRA scaling factor, and experience replay to balance knowledge retention with new learning. Results show that experience replay is the most effective, with further gains achieved by combining it with other methods. These findings provide insights for developing more robust and efficient SLM training pipelines.

Paper Structure

This paper contains 19 sections, 7 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Continual training of a spoken language model using multi-stage speech processing tasks.
  • Figure 2: The architecture of a Spoken Language Model (SLM), which consists of a backbone LLM, a speech encoder that converts speech into speech tokens, and a vocoder that synthesizes the speech tokens into a speech waveform.
  • Figure 3: Evaluation results on instruction-following and question answering. LLaMA, Web, and Trivia denote LLaMA-Questions, Spoken WebQuestions, and Audio Trivia QA. IFEval-P and IFEval-I stand for IFEval in prompt-level and instruction-level. w/ R means with experience replay.