STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models

Kangwook Jang; Sungnyun Kim; Hoirin Kim

STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models

Kangwook Jang, Sungnyun Kim, Hoirin Kim

TL;DR

The paper tackles the challenge of deploying Transformer-based speech self-supervised learning (SSL) models on resource-limited devices by introducing Speech Temporal Relation (STaR) distillation. STaR transfers temporal relations between speech frames using two TGMs—layer-wise and intra-layer—without adding extra parameters, yielding a lightweight, task-agnostic student. Empirical results on the SUPERB benchmark show that a STaR-distilled HuBERT Base student attains an overall score of 79.8 with around 27M parameters, surpassing several heavier compression methods and demonstrating universality across teacher models. The approach enables efficient on-device SSL with robust performance across downstream tasks such as PR, ASR, and speaker-related tasks, marking a practical advance in compressing speech SSL models.

Abstract

Albeit great performance of Transformer-based speech selfsupervised learning (SSL) models, their large parameter size and computational cost make them unfavorable to utilize. In this study, we propose to compress the speech SSL models by distilling speech temporal relation (STaR). Unlike previous works that directly match the representation for each speech frame, STaR distillation transfers temporal relation between speech frames, which is more suitable for lightweight student with limited capacity. We explore three STaR distillation objectives and select the best combination as the final STaR loss. Our model distilled from HuBERT BASE achieves an overall score of 79.8 on SUPERB benchmark, the best performance among models with up to 27 million parameters. We show that our method is applicable across different speech SSL models and maintains robust performance with further reduced parameters.

STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models

TL;DR

Abstract

Paper Structure (12 sections, 5 equations, 2 figures, 3 tables)

This paper contains 12 sections, 5 equations, 2 figures, 3 tables.

Introduction
Speech Temporal Relation Distillation
Average Attention Map Distillation
Temporal Gram Matrix Distillation
Results
Experimental Details
Selection of STaR Loss
Detailed SUPERB Benchmark Results
Examination on Universality
Compression for Smaller Parameter Sizes
Conclusion
References

Figures (2)

Figure 1: We propose three STaR distillation objectives: average attention map, layer-wise TGM, and intra-layer TGM. TGM captures the temporal relation by aggregating the channel information at two time steps. Each individual loss is summed up across all Transformer layers.
Figure 2: Performance comparisons of models with fewer parameters.

STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models

TL;DR

Abstract

STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)