Table of Contents
Fetching ...

Continual Learning Optimizations for Auto-regressive Decoder of Multilingual ASR systems

Chin Yuen Kwok, Jia Qi Yip, Eng Siong Chng

TL;DR

Four optimizations on the auto-regressive decoder of the MASR model are proposed, which reduce the Average Word Error Rate (AWER) of pretrained languages from 14.2% to 12.4% compared with Experience Replay, without compromising the AWER of new languages.

Abstract

Continual Learning (CL) involves fine-tuning pre-trained models with new data while maintaining the performance on the pre-trained data. This is particularly relevant for expanding multilingual ASR (MASR) capabilities. However, existing CL methods, mainly designed for computer vision and reinforcement learning tasks, often yield sub-optimal results when directly applied to MASR. We hypothesise that this is because CL of the auto-regressive decoder in the MASR model is difficult. To verify this, we propose four optimizations on the decoder. They include decoder-layer gradient surgery, freezing unused token embeddings, suppressing output of newly added tokens, and learning rate re-scaling. Our experiments on adapting Whisper to 10 unseen languages from the Common Voice dataset demonstrate that these optimizations reduce the Average Word Error Rate (AWER) of pretrained languages from 14.2% to 12.4% compared with Experience Replay, without compromising the AWER of new languages.

Continual Learning Optimizations for Auto-regressive Decoder of Multilingual ASR systems

TL;DR

Four optimizations on the auto-regressive decoder of the MASR model are proposed, which reduce the Average Word Error Rate (AWER) of pretrained languages from 14.2% to 12.4% compared with Experience Replay, without compromising the AWER of new languages.

Abstract

Continual Learning (CL) involves fine-tuning pre-trained models with new data while maintaining the performance on the pre-trained data. This is particularly relevant for expanding multilingual ASR (MASR) capabilities. However, existing CL methods, mainly designed for computer vision and reinforcement learning tasks, often yield sub-optimal results when directly applied to MASR. We hypothesise that this is because CL of the auto-regressive decoder in the MASR model is difficult. To verify this, we propose four optimizations on the decoder. They include decoder-layer gradient surgery, freezing unused token embeddings, suppressing output of newly added tokens, and learning rate re-scaling. Our experiments on adapting Whisper to 10 unseen languages from the Common Voice dataset demonstrate that these optimizations reduce the Average Word Error Rate (AWER) of pretrained languages from 14.2% to 12.4% compared with Experience Replay, without compromising the AWER of new languages.
Paper Structure (14 sections, 2 equations, 5 figures, 2 tables)

This paper contains 14 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of decoder-layer gradient surgery. For an MASR model that consists of an encoder (left) and decoder (right), we freeze the encoder and only adapt the decoder side. Also, we propose to apply gradient surgery only to the decoder layers, but not to the learnt token embeddings and learnt positional embeddings (PE).
  • Figure 2: Strategy to adapt the token embeddings at the decoder to new tasks. All the embeddings are initialized from pre-trained weights. A) All the embeddings are shared, and updated for the old and new languages. B) A copy of the embeddings are adapted for the new langauge, and the original embeddings are kept for the old language. C) All the embeddings are shared, but only special tokens and tokens used by the new language are adapted.
  • Figure 3: Example of Language ID Token Suppression (LID TS). A) Reference. B) Hypothesis before LID TS. C) Hypothesis after LID TS.
  • Figure 4: Change of LR during training as the validation interval changes. split-$n$ refers to validating every $1/n$ epoch.
  • Figure 5: WER of transcribing 10 pre-trained languages and 10 new languages of varying difficulties. Results are obtained without manually specifying the language to transcribe. "Unadapted" means the unadapted model.