Table of Contents
Fetching ...

Unfolding A Few Structures for The Many: Memory-Efficient Compression of Conformer and Speech Foundation Models

Zhaoqing Li, Haoning Xu, Xurong Xie, Zengrui Jin, Tianzi Wang, Xunying Liu

TL;DR

The paper introduces a memory-efficient, small-to-large network unfolding framework for ASR, enabling a compact seed model to unfold to deeper depths without increasing memory use. It jointly trains multiple unfolding paths and employs self-distillation via KL-based regularization to align seed and deepest unfolded outputs, achieving performance comparable to independently trained larger models. Empirical results on Conformer and wav2vec2/HuBERT backbones show 30-35% parameter reductions with no significant loss in WER, highlighting strong memory and deployment advantages for edge devices. The method offers flexible on-device depth selection, reduces training/storage burden, and outperforms several prior compression techniques in SSL settings.

Abstract

This paper presents a novel memory-efficient model compression approach for Conformer ASR and speech foundation systems. Our approach features a unique "small-to-large" design. A compact "seed" model containing a few Conformer or Transformer blocks is trained and unfolded many times to emulate the performance of larger uncompressed models with different logical depths. The seed model and many unfolded paths are jointly trained within a single unfolding cycle. The KL-divergence between the largest unfolded and smallest seed models is used in a self-distillation process to minimize their performance disparity. Experimental results show that our foldable model produces ASR performance comparable to individually constructed Conformer and wav2vec2/HuBERT speech foundation models under various depth configurations, while requiring only minimal memory and storage. Conformer and wav2vec2 models with a reduction of 35% and 30% parameters are obtained without loss of performance, respectively.

Unfolding A Few Structures for The Many: Memory-Efficient Compression of Conformer and Speech Foundation Models

TL;DR

The paper introduces a memory-efficient, small-to-large network unfolding framework for ASR, enabling a compact seed model to unfold to deeper depths without increasing memory use. It jointly trains multiple unfolding paths and employs self-distillation via KL-based regularization to align seed and deepest unfolded outputs, achieving performance comparable to independently trained larger models. Empirical results on Conformer and wav2vec2/HuBERT backbones show 30-35% parameter reductions with no significant loss in WER, highlighting strong memory and deployment advantages for edge devices. The method offers flexible on-device depth selection, reduces training/storage burden, and outperforms several prior compression techniques in SSL settings.

Abstract

This paper presents a novel memory-efficient model compression approach for Conformer ASR and speech foundation systems. Our approach features a unique "small-to-large" design. A compact "seed" model containing a few Conformer or Transformer blocks is trained and unfolded many times to emulate the performance of larger uncompressed models with different logical depths. The seed model and many unfolded paths are jointly trained within a single unfolding cycle. The KL-divergence between the largest unfolded and smallest seed models is used in a self-distillation process to minimize their performance disparity. Experimental results show that our foldable model produces ASR performance comparable to individually constructed Conformer and wav2vec2/HuBERT speech foundation models under various depth configurations, while requiring only minimal memory and storage. Conformer and wav2vec2 models with a reduction of 35% and 30% parameters are obtained without loss of performance, respectively.

Paper Structure

This paper contains 12 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Diagram of a foldable network. Systems (a), (b), and (c) share the same $N$ physical layers, and they can (un)fold to each other by changing the number of repeating times of the physical layers, without expanding parameters and memory.
  • Figure 2: Averaged WER of Conformer systems versus executed encoder depths (# physical layers + unfolded depths). The foldable and all-physical models are the same ones in Table \ref{['tab:1']}. Solid circles denote models that purely contain physical layers, while hollow circles denote foldable models unfolded from corresponding all-physical systems. Colored areas cover the ranges of standard deviation across different unfolding paths.
  • Figure 3: WER (test-clean) of wav2vec2 versus executed encoder depths. The foldable and compact models are the same ones in Table \ref{['tab:2']}. Solid circles denote systems purely containing physical layers, while hollow circles denote foldable systems unfolded from corresponding all-physical systems.