Table of Contents
Fetching ...

Unifying Model and Layer Fusion for Speech Foundation Models

Yi-Jen Shih, David Harwath

TL;DR

This work addresses the gap in effectively leveraging multiple Speech Foundation Models by proposing a unified interface that jointly fuses across both models and their internal layers. By matching layer counts and temporal lengths and implementing two fusion schemes—Hierarchical Convolution (HConv) and Concatenation with projection (CHConv)—the approach enables scalable fusion of $N$ upstream models and their layer representations, with practical instantiations shown as $\mathbb{R}^{L\times T\times D}$ or $\mathbb{R}^{L\times T\times (N D)}$ outputs. Extensive experiments across ASR (LibriSpeech, ML-SUPERB) and non-ASR tasks (SV, ER) reveal that joint model-and-layer fusion generally outperforms single-model baselines and prior fusion methods, with larger gains when fusing self-supervised and supervised models and when employing large Speech Foundation Models. The findings underscore the importance of upstream-model selection, demonstrate the robustness of the proposed interfaces across languages and tasks, and suggest that joint distillation could be a fruitful direction to mitigate inference costs in future work. Overall, the proposed unified fusion framework offers a scalable pathway to harness diverse SFMs for robust speech understanding in practical applications.

Abstract

Speech Foundation Models have gained significant attention recently. Prior works have shown that the fusion of representations from multiple layers of the same model or the fusion of multiple models can improve performance on downstream tasks. We unify these two fusion strategies by proposing an interface module that enables fusion across multiple upstream speech models while integrating information across their layers. We conduct extensive experiments on different self-supervised and supervised models across various speech tasks, including ASR and paralinguistic analysis, and demonstrate that our method outperforms prior fusion approaches. We further analyze its scalability concerning model size and count, highlighting the importance of selecting appropriate upstream models. Our results show that the proposed interface provides an additional performance boost when given a suitable upstream model selection, making it a promising approach for utilizing Speech Foundation Models.

Unifying Model and Layer Fusion for Speech Foundation Models

TL;DR

This work addresses the gap in effectively leveraging multiple Speech Foundation Models by proposing a unified interface that jointly fuses across both models and their internal layers. By matching layer counts and temporal lengths and implementing two fusion schemes—Hierarchical Convolution (HConv) and Concatenation with projection (CHConv)—the approach enables scalable fusion of upstream models and their layer representations, with practical instantiations shown as or outputs. Extensive experiments across ASR (LibriSpeech, ML-SUPERB) and non-ASR tasks (SV, ER) reveal that joint model-and-layer fusion generally outperforms single-model baselines and prior fusion methods, with larger gains when fusing self-supervised and supervised models and when employing large Speech Foundation Models. The findings underscore the importance of upstream-model selection, demonstrate the robustness of the proposed interfaces across languages and tasks, and suggest that joint distillation could be a fruitful direction to mitigate inference costs in future work. Overall, the proposed unified fusion framework offers a scalable pathway to harness diverse SFMs for robust speech understanding in practical applications.

Abstract

Speech Foundation Models have gained significant attention recently. Prior works have shown that the fusion of representations from multiple layers of the same model or the fusion of multiple models can improve performance on downstream tasks. We unify these two fusion strategies by proposing an interface module that enables fusion across multiple upstream speech models while integrating information across their layers. We conduct extensive experiments on different self-supervised and supervised models across various speech tasks, including ASR and paralinguistic analysis, and demonstrate that our method outperforms prior fusion approaches. We further analyze its scalability concerning model size and count, highlighting the importance of selecting appropriate upstream models. Our results show that the proposed interface provides an additional performance boost when given a suitable upstream model selection, making it a promising approach for utilizing Speech Foundation Models.

Paper Structure

This paper contains 19 sections, 1 equation, 1 figure, 3 tables.

Figures (1)

  • Figure 1: An overview framework of Interface with fusion.