Table of Contents
Fetching ...

JOOCI: a Framework for Learning Comprehensive Speech Representations

Hemant Yadav, Rajiv Ratn Shah, Sunayana Sitaram

TL;DR

JOOCI presents a novel framework for learning comprehensive speech representations by jointly optimizing Content and Other information without sacrificing representational-depth. It introduces a split architecture with a shared front-end, a Content encoder for linguistic content, and an Other encoder for paralinguistic information, coupled with a split-and-append mechanism and separate training losses: Multicluster MPL for content, a teacher-student RDINO-based objective for the Other channel, and a regularizer with a gradient reversal layer. The method achieves a 26.5% improvement over WavLM on the SUPERB benchmark while using a comparable number of parameters and without task-specific adapters, demonstrating strong cross-task generalization across ASR, PR, SID, and ASV. These results indicate JOOCI's effective utilization of the full representational-depth to jointly encode linguistic and paralinguistic cues, with practical implications for robust, multi-task speech systems.

Abstract

Information in speech can be categorized into two groups: Content (what is being said, such as linguistics) and Other (how it is expressed such as information about speaker and paralinguistic features). Current self-supervised learning (SSL) methods are shown to divide the model's representational-depth or layers in two, with earlier layers specializing in Other and later layers in Content related tasks. This layer-wise division is inherently sub-optimal, as neither information type can use all layers to build hierarchical representations. To address this, we propose JOOCI, a novel speech representation learning method that does not compromise on the representational-depth for either information type. JOOCI outperforms WavLM by 26.5%, and other models of similar size (100M parameters), when evaluated on two speaker recognition and two language tasks from the SUPERB benchmark, demonstrating its effectiveness in Jointly Optimizing Other and Content Information (JOOCI).

JOOCI: a Framework for Learning Comprehensive Speech Representations

TL;DR

JOOCI presents a novel framework for learning comprehensive speech representations by jointly optimizing Content and Other information without sacrificing representational-depth. It introduces a split architecture with a shared front-end, a Content encoder for linguistic content, and an Other encoder for paralinguistic information, coupled with a split-and-append mechanism and separate training losses: Multicluster MPL for content, a teacher-student RDINO-based objective for the Other channel, and a regularizer with a gradient reversal layer. The method achieves a 26.5% improvement over WavLM on the SUPERB benchmark while using a comparable number of parameters and without task-specific adapters, demonstrating strong cross-task generalization across ASR, PR, SID, and ASV. These results indicate JOOCI's effective utilization of the full representational-depth to jointly encode linguistic and paralinguistic cues, with practical implications for robust, multi-task speech systems.

Abstract

Information in speech can be categorized into two groups: Content (what is being said, such as linguistics) and Other (how it is expressed such as information about speaker and paralinguistic features). Current self-supervised learning (SSL) methods are shown to divide the model's representational-depth or layers in two, with earlier layers specializing in Other and later layers in Content related tasks. This layer-wise division is inherently sub-optimal, as neither information type can use all layers to build hierarchical representations. To address this, we propose JOOCI, a novel speech representation learning method that does not compromise on the representational-depth for either information type. JOOCI outperforms WavLM by 26.5%, and other models of similar size (100M parameters), when evaluated on two speaker recognition and two language tasks from the SUPERB benchmark, demonstrating its effectiveness in Jointly Optimizing Other and Content Information (JOOCI).

Paper Structure

This paper contains 22 sections, 5 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Information in speech can be categorized into Other and Content. A model that claims to learn comprehensive speech representations, capturing both Other and Content information, must excel across downstream tasks, including ASR, PR, ST, SID, SV, and ER.
  • Figure 2: JOOCI Method. As shown, raw audio is processed through the shared Encoder$^S$ ($E^S$), and the output is passed to the Encoder$^O$ ($E^O$) and Encoder$^C$ ($E^C$) encoders. The split-and-append mechanism enables the Other encoder to extract useful information from the Content encoder during the forward pass, while preventing gradient flow during the backward pass. The overall design ensures that both encoders can operate independently while still benefiting from $E^S$. During inference, $E^O$ is used for tasks requiring Other information and $E^C$ for Content related tasks.
  • Figure 3: Weight analysis on the SUPERB benchmark. Layer 0 corresponds to the input of the first Transformer layer. The y-axis represents different tasks, while the x-axis represents different layers. The higher the layer weight, the greater its contribution to the weighted sum. For comparison with HuBERT and WavLM please see appendix Section \ref{['section:hubwavsuperb']}.
  • Figure 4: Studying the effect of data augmentation on the content encoder using CCA word label similarity. Higher the CCA similarity for more number of layers better the method is.
  • Figure 5: HuBERT weight analysis on the SUPERB benchmark.
  • ...and 1 more figures