Table of Contents
Fetching ...

Sustainable self-supervised learning for speech representations

Luis Lugo, Valentin Vielzeuf

TL;DR

Self-supervised speech representations deliver strong downstream performance but incur large compute and energy costs, raising sustainability concerns. The paper surveys predictive and contrastive SSL methods, multilingual models, and efficiency-focused strategies across optimization, architecture, fine-tuning, and data, proposing concrete techniques to cut memory and compute while maintaining performance. It highlights approaches such as DistilHuBERT, MelHuBERT, LoRA, READ, Fnet, and FlashAttention, illustrating trade-offs between cost and accuracy. The findings suggest that substantially more efficient SSL can approach large-model performance, enabling broader deployment and reproducibility in resource-constrained settings.

Abstract

Sustainable artificial intelligence focuses on data, hardware, and algorithms to make machine learning models more environmentally responsible. In particular, machine learning models for speech representations are computationally expensive, generating environmental concerns because of their high energy consumption. Thus, we propose a sustainable self-supervised model to learn speech representation, combining optimizations in neural layers and training to reduce computing costs. The proposed model improves over a resource-efficient baseline, reducing both memory usage and computing cost estimations. It pretrains using a single GPU in less than a day. On top of that, it improves the error rate performance of the baseline in downstream task evaluations. When comparing it to large speech representation approaches, there is an order of magnitude reduction in memory usage, while computing cost reductions represent almost three orders of magnitude improvement.

Sustainable self-supervised learning for speech representations

TL;DR

Self-supervised speech representations deliver strong downstream performance but incur large compute and energy costs, raising sustainability concerns. The paper surveys predictive and contrastive SSL methods, multilingual models, and efficiency-focused strategies across optimization, architecture, fine-tuning, and data, proposing concrete techniques to cut memory and compute while maintaining performance. It highlights approaches such as DistilHuBERT, MelHuBERT, LoRA, READ, Fnet, and FlashAttention, illustrating trade-offs between cost and accuracy. The findings suggest that substantially more efficient SSL can approach large-model performance, enabling broader deployment and reproducibility in resource-constrained settings.

Abstract

Sustainable artificial intelligence focuses on data, hardware, and algorithms to make machine learning models more environmentally responsible. In particular, machine learning models for speech representations are computationally expensive, generating environmental concerns because of their high energy consumption. Thus, we propose a sustainable self-supervised model to learn speech representation, combining optimizations in neural layers and training to reduce computing costs. The proposed model improves over a resource-efficient baseline, reducing both memory usage and computing cost estimations. It pretrains using a single GPU in less than a day. On top of that, it improves the error rate performance of the baseline in downstream task evaluations. When comparing it to large speech representation approaches, there is an order of magnitude reduction in memory usage, while computing cost reductions represent almost three orders of magnitude improvement.
Paper Structure (11 sections, 9 equations, 3 figures, 1 table)

This paper contains 11 sections, 9 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Recent self-supervised approaches for speech representation learning, including contrastive and predictive approaches, and multilingual models.
  • Figure 2: Existing models for self-supervised speech representation learning. This trend illustrates the trade-off between computational costs and representation performance.
  • Figure 3: Methods proposing efficiency-oriented approaches that can deal with the computational costs of self-supervised learning architectures, including new architectures, self-attention improvements, and efficient transfer learning.