JOOCI: a Framework for Learning Comprehensive Speech Representations

Hemant Yadav; Rajiv Ratn Shah; Sunayana Sitaram

JOOCI: a Framework for Learning Comprehensive Speech Representations

Hemant Yadav, Rajiv Ratn Shah, Sunayana Sitaram

TL;DR

JOOCI presents a novel framework for learning comprehensive speech representations by jointly optimizing Content and Other information without sacrificing representational-depth. It introduces a split architecture with a shared front-end, a Content encoder for linguistic content, and an Other encoder for paralinguistic information, coupled with a split-and-append mechanism and separate training losses: Multicluster MPL for content, a teacher-student RDINO-based objective for the Other channel, and a regularizer with a gradient reversal layer. The method achieves a 26.5% improvement over WavLM on the SUPERB benchmark while using a comparable number of parameters and without task-specific adapters, demonstrating strong cross-task generalization across ASR, PR, SID, and ASV. These results indicate JOOCI's effective utilization of the full representational-depth to jointly encode linguistic and paralinguistic cues, with practical implications for robust, multi-task speech systems.

Abstract

Information in speech can be categorized into two groups: Content (what is being said, such as linguistics) and Other (how it is expressed such as information about speaker and paralinguistic features). Current self-supervised learning (SSL) methods are shown to divide the model's representational-depth or layers in two, with earlier layers specializing in Other and later layers in Content related tasks. This layer-wise division is inherently sub-optimal, as neither information type can use all layers to build hierarchical representations. To address this, we propose JOOCI, a novel speech representation learning method that does not compromise on the representational-depth for either information type. JOOCI outperforms WavLM by 26.5%, and other models of similar size (100M parameters), when evaluated on two speaker recognition and two language tasks from the SUPERB benchmark, demonstrating its effectiveness in Jointly Optimizing Other and Content Information (JOOCI).

JOOCI: a Framework for Learning Comprehensive Speech Representations

TL;DR

Abstract

JOOCI: a Framework for Learning Comprehensive Speech Representations

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)