Sustainable self-supervised learning for speech representations

Luis Lugo; Valentin Vielzeuf

Sustainable self-supervised learning for speech representations

Luis Lugo, Valentin Vielzeuf

TL;DR

Self-supervised speech representations deliver strong downstream performance but incur large compute and energy costs, raising sustainability concerns. The paper surveys predictive and contrastive SSL methods, multilingual models, and efficiency-focused strategies across optimization, architecture, fine-tuning, and data, proposing concrete techniques to cut memory and compute while maintaining performance. It highlights approaches such as DistilHuBERT, MelHuBERT, LoRA, READ, Fnet, and FlashAttention, illustrating trade-offs between cost and accuracy. The findings suggest that substantially more efficient SSL can approach large-model performance, enabling broader deployment and reproducibility in resource-constrained settings.

Abstract

Sustainable artificial intelligence focuses on data, hardware, and algorithms to make machine learning models more environmentally responsible. In particular, machine learning models for speech representations are computationally expensive, generating environmental concerns because of their high energy consumption. Thus, we propose a sustainable self-supervised model to learn speech representation, combining optimizations in neural layers and training to reduce computing costs. The proposed model improves over a resource-efficient baseline, reducing both memory usage and computing cost estimations. It pretrains using a single GPU in less than a day. On top of that, it improves the error rate performance of the baseline in downstream task evaluations. When comparing it to large speech representation approaches, there is an order of magnitude reduction in memory usage, while computing cost reductions represent almost three orders of magnitude improvement.

Sustainable self-supervised learning for speech representations

TL;DR

Abstract

Paper Structure (11 sections, 9 equations, 3 figures, 1 table)

This paper contains 11 sections, 9 equations, 3 figures, 1 table.

Introduction
Self-supervised learning for speech
Predictive self-supervised approaches
Contrastive self-supervised approaches
Multilingual models
Limitations of self-supervised models
Towards optimization of existing speech models
Towards neural architecture efficiency
Towards finetuning efficiency
Towards data efficiency for model pretraining
Conclusions

Figures (3)

Figure 1: Recent self-supervised approaches for speech representation learning, including contrastive and predictive approaches, and multilingual models.
Figure 2: Existing models for self-supervised speech representation learning. This trend illustrates the trade-off between computational costs and representation performance.
Figure 3: Methods proposing efficiency-oriented approaches that can deal with the computational costs of self-supervised learning architectures, including new architectures, self-attention improvements, and efficient transfer learning.

Sustainable self-supervised learning for speech representations

TL;DR

Abstract

Sustainable self-supervised learning for speech representations

Authors

TL;DR

Abstract

Table of Contents

Figures (3)