Table of Contents
Fetching ...

Hierarchical Recurrent Adapters for Efficient Multi-Task Adaptation of Large Speech Models

Tsendsuren Munkhdalai, Youzheng Chen, Khe Chai Sim, Fadi Biadsy, Tara Sainath, Pedro Moreno Mengibar

TL;DR

The paper tackles the challenge of adapting large pre-trained speech models to a large number of downstream tasks without prohibitive per-task parameter overhead. It proposes Hierarchical Recurrent Adapter (HRA), which couples a single shared IndRNN-based controller with per-task adapter heads that are shared across the model depth, drastically reducing task-specific parameters. Two head architectures are explored—Linear and Feed-Forward—with the adapter outputs added residually to backbone activations to form task-specific representations. Empirical results on automatic speech recognition show that HRA reduces parameter requirements by factors of 2–8 and achieves competitive or improved $WER$ compared with full fine-tuning, with a total parameter count of about $12.8$M, enabling scalable multi-task adaptation. This approach offers a practical, modular path to efficient, high-performance multi-task ASR.

Abstract

Parameter efficient adaptation methods have become a key mechanism to train large pre-trained models for downstream tasks. However, their per-task parameter overhead is considered still high when the number of downstream tasks to adapt for is large. We introduce an adapter module that has a better efficiency in large scale multi-task adaptation scenario. Our adapter is hierarchical in terms of how the adapter parameters are allocated. The adapter consists of a single shared controller network and multiple task-level adapter heads to reduce the per-task parameter overhead without performance regression on downstream tasks. The adapter is also recurrent so the entire adapter parameters are reused across different layers of the pre-trained model. Our Hierarchical Recurrent Adapter (HRA) outperforms the previous adapter-based approaches as well as full model fine-tuning baseline in both single and multi-task adaptation settings when evaluated on automatic speech recognition tasks.

Hierarchical Recurrent Adapters for Efficient Multi-Task Adaptation of Large Speech Models

TL;DR

The paper tackles the challenge of adapting large pre-trained speech models to a large number of downstream tasks without prohibitive per-task parameter overhead. It proposes Hierarchical Recurrent Adapter (HRA), which couples a single shared IndRNN-based controller with per-task adapter heads that are shared across the model depth, drastically reducing task-specific parameters. Two head architectures are explored—Linear and Feed-Forward—with the adapter outputs added residually to backbone activations to form task-specific representations. Empirical results on automatic speech recognition show that HRA reduces parameter requirements by factors of 2–8 and achieves competitive or improved compared with full fine-tuning, with a total parameter count of about M, enabling scalable multi-task adaptation. This approach offers a practical, modular path to efficient, high-performance multi-task ASR.

Abstract

Parameter efficient adaptation methods have become a key mechanism to train large pre-trained models for downstream tasks. However, their per-task parameter overhead is considered still high when the number of downstream tasks to adapt for is large. We introduce an adapter module that has a better efficiency in large scale multi-task adaptation scenario. Our adapter is hierarchical in terms of how the adapter parameters are allocated. The adapter consists of a single shared controller network and multiple task-level adapter heads to reduce the per-task parameter overhead without performance regression on downstream tasks. The adapter is also recurrent so the entire adapter parameters are reused across different layers of the pre-trained model. Our Hierarchical Recurrent Adapter (HRA) outperforms the previous adapter-based approaches as well as full model fine-tuning baseline in both single and multi-task adaptation settings when evaluated on automatic speech recognition tasks.
Paper Structure (7 sections, 4 equations, 1 figure)

This paper contains 7 sections, 4 equations, 1 figure.

Figures (1)

  • Figure 1: Hierarchical Recurrent Adapter (HRA). The yellow box indicates layers of the underlying backbone speech model. The HRA consists of a single recurrent controller and multiple task-level adapter heads. The output of the adapter head is added to the backbone feature for adaptation of downstream speech tasks. In HRA, the adapter heads and the recurrent controller weights are shared across all layers keeping the adapter parameter overhead minimal.