Quantifying Dimensional Independence in Speech: An Information-Theoretic Framework for Disentangled Representation Learning

Bipasha Kashyap; Björn W. Schuller; Pubudu N. Pathirana

Quantifying Dimensional Independence in Speech: An Information-Theoretic Framework for Disentangled Representation Learning

Bipasha Kashyap, Björn W. Schuller, Pubudu N. Pathirana

TL;DR

This work introduces an information-theoretic framework to quantify cross-dimension statistical dependence in handcrafted acoustic features by integrating bounded neural mutual information (MI) estimation with non-parametric validation and provides a principled framework for quantifying dimensional independence in speech.

Abstract

Speech signals encode emotional, linguistic, and pathological information within a shared acoustic channel; however, disentanglement is typically assessed indirectly through downstream task performance. We introduce an information-theoretic framework to quantify cross-dimension statistical dependence in handcrafted acoustic features by integrating bounded neural mutual information (MI) estimation with non-parametric validation. Across six corpora, cross-dimension MI remains low, with tight estimation bounds ($< 0.15$ nats), indicating weak statistical coupling in the data considered, whereas Source--Filter MI is substantially higher (0.47 nats). Attribution analysis, defined as the proportion of total MI attributable to source versus filter components, reveals source dominance for emotional dimensions (80\%) and filter dominance for linguistic and pathological dimensions (60\% and 58\%, respectively). These findings provide a principled framework for quantifying dimensional independence in speech.

Quantifying Dimensional Independence in Speech: An Information-Theoretic Framework for Disentangled Representation Learning

TL;DR

Abstract

nats), indicating weak statistical coupling in the data considered, whereas Source--Filter MI is substantially higher (0.47 nats). Attribution analysis, defined as the proportion of total MI attributable to source versus filter components, reveals source dominance for emotional dimensions (80\%) and filter dominance for linguistic and pathological dimensions (60\% and 58\%, respectively). These findings provide a principled framework for quantifying dimensional independence in speech.

Paper Structure (25 sections, 14 equations, 4 figures, 1 table)

This paper contains 25 sections, 14 equations, 4 figures, 1 table.

Introduction
Methodology
Problem Formulation
Feature Extraction
Bounded Neural MI Estimation
MINE with EMA Stabilisation
CLUB with Variance Clamping
KSG Non-Parametric Validation
Combined Estimation with Adaptive Weighting
Training Protocol
Source-Filter Attribution
Experimental Setup
Datasets
Implementation Details
Results
...and 10 more sections

Figures (4)

Figure 1: Proposed framework for quantifying dimensional independence.
Figure 2: Cross-dimension MI (Final) heatmap across dataset combinations. Lower values (lighter) indicate greater independence between feature sets. Top marginal bars show per-pair means averaged across all combinations.
Figure 3: MI estimation convergence across dimensions: (a) Emotion–Linguistic, (b) Emotion–Pathology, (c) Linguistic–Pathology, and (d) Source–Filter. Per-ensemble MINE and CLUB trajectories are shown (offset for visual clarity), alongside the KSG baseline (dashed) and the final consensus estimate for a representative dataset combination (RAVDESS, L2-ARCTIC, UA-Speech). The mean estimator gap ($\Delta$) over the final 10 training epochs is reported, demonstrating progressive reduction and convergence over training.
Figure 4: Source–Filter attribution across semantic dimensions. Stacked bars show the mean proportion of MI carried by Source vs. Filter components (Eq. 3); individual dataset combinations are overlaid as jitter points. Error bars indicate 95% CI.

Quantifying Dimensional Independence in Speech: An Information-Theoretic Framework for Disentangled Representation Learning

TL;DR

Abstract

Quantifying Dimensional Independence in Speech: An Information-Theoretic Framework for Disentangled Representation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)