Table of Contents
Fetching ...

PEFT for Speech: Unveiling Optimal Placement, Merging Strategies, and Ensemble Techniques

Tzu-Han Lin, How-Shing Wang, Hao-Yung Weng, Kuang-Chen Peng, Zih-Ching Chen, Hung-yi Lee

TL;DR

Parameter-Efficient Fine-Tuning (PEFT) is explored for SSL speech models to reduce compute and storage during fine-tuning. The authors compare DARTS-based layer placement, a Hybrid method that merges adapters within layers, and ensemble strategies to fuse outputs from diverse PEFTs. They find that differentiable architecture search for PEFT placement does not beat a simple all-layer PEFT baseline, while ensemble majority voting yields the best performance, and that PEFT methods encode complementary information, especially for CTC-based tasks requiring alignment handling. The work provides practical guidance for deploying PEFT in speech applications under fixed parameter budgets and highlights the value of ensemble designs to leverage diverse adapter representations.

Abstract

Parameter-Efficient Fine-Tuning (PEFT) is increasingly recognized as an effective method in speech processing. However, the optimal approach and the placement of PEFT methods remain inconclusive. Our study conducts extensive experiments to compare different PEFT methods and their layer-wise placement adapting Differentiable Architecture Search (DARTS). We also explore the use of ensemble learning to leverage diverse PEFT strategies. The results reveal that DARTS does not outperform the baseline approach, which involves inserting the same PEFT method into all layers of a Self-Supervised Learning (SSL) model. In contrast, an ensemble learning approach, particularly one employing majority voting, demonstrates superior performance. Our statistical evidence indicates that different PEFT methods learn in varied ways. This variation might explain why the synergistic integration of various PEFT methods through ensemble learning can harness their unique learning capabilities more effectively compared to individual layer-wise optimization.

PEFT for Speech: Unveiling Optimal Placement, Merging Strategies, and Ensemble Techniques

TL;DR

Parameter-Efficient Fine-Tuning (PEFT) is explored for SSL speech models to reduce compute and storage during fine-tuning. The authors compare DARTS-based layer placement, a Hybrid method that merges adapters within layers, and ensemble strategies to fuse outputs from diverse PEFTs. They find that differentiable architecture search for PEFT placement does not beat a simple all-layer PEFT baseline, while ensemble majority voting yields the best performance, and that PEFT methods encode complementary information, especially for CTC-based tasks requiring alignment handling. The work provides practical guidance for deploying PEFT in speech applications under fixed parameter budgets and highlights the value of ensemble designs to leverage diverse adapter representations.

Abstract

Parameter-Efficient Fine-Tuning (PEFT) is increasingly recognized as an effective method in speech processing. However, the optimal approach and the placement of PEFT methods remain inconclusive. Our study conducts extensive experiments to compare different PEFT methods and their layer-wise placement adapting Differentiable Architecture Search (DARTS). We also explore the use of ensemble learning to leverage diverse PEFT strategies. The results reveal that DARTS does not outperform the baseline approach, which involves inserting the same PEFT method into all layers of a Self-Supervised Learning (SSL) model. In contrast, an ensemble learning approach, particularly one employing majority voting, demonstrates superior performance. Our statistical evidence indicates that different PEFT methods learn in varied ways. This variation might explain why the synergistic integration of various PEFT methods through ensemble learning can harness their unique learning capabilities more effectively compared to individual layer-wise optimization.
Paper Structure (14 sections, 1 equation, 2 figures, 3 tables)

This paper contains 14 sections, 1 equation, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The architecture of the Hybrid Method. The trainable/frozen parameters are colored in red/blue.
  • Figure 2: Selected PEFT methods for each layer and their associated weights for PR and SID. The shade intensity of each cell indicates the weight associated with each layer.