Table of Contents
Fetching ...

Towards continually learning new languages

Ngoc-Quan Pham, Jan Niehues, Alexander Waibel

TL;DR

Multilingual ASR often trains languages in batches; adding new languages post hoc risks catastrophic forgetting. The authors propose a continual-learning strategy that combines weight factorization with elastic weight consolidation to separate shared and language-specific capacity and protect prior knowledge. They show that starting from 10 languages and extending to 26 can achieve near joint-training performance for new languages (about 4% WER) with only modest degradation of earlier languages (about 9% WER), outperforming naive fine-tuning. The results indicate a practical path to scalable, cost-efficient multilingual ASR, with capacity-driven limitations that may be mitigated by distillation or further architectural refinements.

Abstract

Multilingual speech recognition with neural networks is often implemented with batch-learning, when all of the languages are available before training. An ability to add new languages after the prior training sessions can be economically beneficial, but the main challenge is catastrophic forgetting. In this work, we combine the qualities of weight factorization and elastic weight consolidation in order to counter catastrophic forgetting and facilitate learning new languages quickly. Such combination allowed us to eliminate catastrophic forgetting while still achieving performance for the new languages comparable with having all languages at once, in experiments of learning from an initial 10 languages to achieve 26 languages without catastrophic forgetting and a reasonable performance compared to training all languages from scratch.

Towards continually learning new languages

TL;DR

Multilingual ASR often trains languages in batches; adding new languages post hoc risks catastrophic forgetting. The authors propose a continual-learning strategy that combines weight factorization with elastic weight consolidation to separate shared and language-specific capacity and protect prior knowledge. They show that starting from 10 languages and extending to 26 can achieve near joint-training performance for new languages (about 4% WER) with only modest degradation of earlier languages (about 9% WER), outperforming naive fine-tuning. The results indicate a practical path to scalable, cost-efficient multilingual ASR, with capacity-driven limitations that may be mitigated by distillation or further architectural refinements.

Abstract

Multilingual speech recognition with neural networks is often implemented with batch-learning, when all of the languages are available before training. An ability to add new languages after the prior training sessions can be economically beneficial, but the main challenge is catastrophic forgetting. In this work, we combine the qualities of weight factorization and elastic weight consolidation in order to counter catastrophic forgetting and facilitate learning new languages quickly. Such combination allowed us to eliminate catastrophic forgetting while still achieving performance for the new languages comparable with having all languages at once, in experiments of learning from an initial 10 languages to achieve 26 languages without catastrophic forgetting and a reasonable performance compared to training all languages from scratch.
Paper Structure (11 sections, 5 equations, 1 figure, 1 table)

This paper contains 11 sections, 5 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Comparison between different approaches: Weight factorization (WF) with frozen/fine-tuned/elastic shared weights, elastic weight consolidation (EWC) and a simple fine-tuning (Vanilla). Reported is the average of word error rates (WER) for the languages in the set.