Table of Contents
Fetching ...

CA-SSLR: Condition-Aware Self-Supervised Learning Representation for Generalized Speech Processing

Yen-Ju Lu, Jing Liu, Thomas Thebaud, Laureano Moro-Velazquez, Ariya Rastrow, Najim Dehak, Jesus Villalba

TL;DR

CA-SSLR presents a condition-aware self-supervised speech representation by injecting language and speaker context into a frozen SSL encoder through hierarchical time-channel conditioning. The method uses lightweight adapters and TCAC modules to dynamically modulate representations, enabling a single generalist SSLR to perform multiple tasks with minimal task-specific tuning. Empirical results demonstrate notable improvements in LID, ASR CER, and SV EER across ML-SUPERB and VoxCeleb benchmarks, along with favorable parameter efficiency and potential streaming viability. The approach offers a practical pathway to robust multilingual multispeaker speech processing with reduced forgetting and improved generalization across unseen tasks.

Abstract

We introduce Condition-Aware Self-Supervised Learning Representation (CA-SSLR), a generalist conditioning model broadly applicable to various speech-processing tasks. Compared to standard fine-tuning methods that optimize for downstream models, CA-SSLR integrates language and speaker embeddings from earlier layers, making the SSL model aware of the current language and speaker context. This approach reduces the reliance on input audio features while preserving the integrity of the base SSLR. CA-SSLR improves the model's capabilities and demonstrates its generality on unseen tasks with minimal task-specific tuning. Our method employs linear modulation to dynamically adjust internal representations, enabling fine-grained adaptability without significantly altering the original model behavior. Experiments show that CA-SSLR reduces the number of trainable parameters, mitigates overfitting, and excels in under-resourced and unseen tasks. Specifically, CA-SSLR achieves a 10% relative reduction in LID errors, a 37% improvement in ASR CER on the ML-SUPERB benchmark, and a 27% decrease in SV EER on VoxCeleb-1, demonstrating its effectiveness.

CA-SSLR: Condition-Aware Self-Supervised Learning Representation for Generalized Speech Processing

TL;DR

CA-SSLR presents a condition-aware self-supervised speech representation by injecting language and speaker context into a frozen SSL encoder through hierarchical time-channel conditioning. The method uses lightweight adapters and TCAC modules to dynamically modulate representations, enabling a single generalist SSLR to perform multiple tasks with minimal task-specific tuning. Empirical results demonstrate notable improvements in LID, ASR CER, and SV EER across ML-SUPERB and VoxCeleb benchmarks, along with favorable parameter efficiency and potential streaming viability. The approach offers a practical pathway to robust multilingual multispeaker speech processing with reduced forgetting and improved generalization across unseen tasks.

Abstract

We introduce Condition-Aware Self-Supervised Learning Representation (CA-SSLR), a generalist conditioning model broadly applicable to various speech-processing tasks. Compared to standard fine-tuning methods that optimize for downstream models, CA-SSLR integrates language and speaker embeddings from earlier layers, making the SSL model aware of the current language and speaker context. This approach reduces the reliance on input audio features while preserving the integrity of the base SSLR. CA-SSLR improves the model's capabilities and demonstrates its generality on unseen tasks with minimal task-specific tuning. Our method employs linear modulation to dynamically adjust internal representations, enabling fine-grained adaptability without significantly altering the original model behavior. Experiments show that CA-SSLR reduces the number of trainable parameters, mitigates overfitting, and excels in under-resourced and unseen tasks. Specifically, CA-SSLR achieves a 10% relative reduction in LID errors, a 37% improvement in ASR CER on the ML-SUPERB benchmark, and a 27% decrease in SV EER on VoxCeleb-1, demonstrating its effectiveness.

Paper Structure

This paper contains 46 sections, 5 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: CA-SSLR scheme and its time-channel attention conditioner. Only the conditioner and linear projections for the decoders are trainable, and all other parameters are frozen during adaptation.
  • Figure 2: Architecture of the CA-SSLR model employing hierarchical self-conditioning with Time-Channel Attention Conditioners (TCACs).
  • Figure 3: CER versus trainable parameters on XLSR model for Normal and Few-shots languages, demonstrating the adaptation ability for the TCA conditioner.
  • Figure 4: Ablation study of condition-aware settings for ASR-adapted XLSR models on 10-min ML-SUPERB dataset, using CC or TCAC. Conditioning is based on predicted language labels or LID embeddings, except in the ground truth (G.T.) experiment.