Table of Contents
Fetching ...

Quantifying Dimensional Independence in Speech: An Information-Theoretic Framework for Disentangled Representation Learning

Bipasha Kashyap, Björn W. Schuller, Pubudu N. Pathirana

TL;DR

This work introduces an information-theoretic framework to quantify cross-dimension statistical dependence in handcrafted acoustic features by integrating bounded neural mutual information (MI) estimation with non-parametric validation and provides a principled framework for quantifying dimensional independence in speech.

Abstract

Speech signals encode emotional, linguistic, and pathological information within a shared acoustic channel; however, disentanglement is typically assessed indirectly through downstream task performance. We introduce an information-theoretic framework to quantify cross-dimension statistical dependence in handcrafted acoustic features by integrating bounded neural mutual information (MI) estimation with non-parametric validation. Across six corpora, cross-dimension MI remains low, with tight estimation bounds ($< 0.15$ nats), indicating weak statistical coupling in the data considered, whereas Source--Filter MI is substantially higher (0.47 nats). Attribution analysis, defined as the proportion of total MI attributable to source versus filter components, reveals source dominance for emotional dimensions (80\%) and filter dominance for linguistic and pathological dimensions (60\% and 58\%, respectively). These findings provide a principled framework for quantifying dimensional independence in speech.

Quantifying Dimensional Independence in Speech: An Information-Theoretic Framework for Disentangled Representation Learning

TL;DR

This work introduces an information-theoretic framework to quantify cross-dimension statistical dependence in handcrafted acoustic features by integrating bounded neural mutual information (MI) estimation with non-parametric validation and provides a principled framework for quantifying dimensional independence in speech.

Abstract

Speech signals encode emotional, linguistic, and pathological information within a shared acoustic channel; however, disentanglement is typically assessed indirectly through downstream task performance. We introduce an information-theoretic framework to quantify cross-dimension statistical dependence in handcrafted acoustic features by integrating bounded neural mutual information (MI) estimation with non-parametric validation. Across six corpora, cross-dimension MI remains low, with tight estimation bounds ( nats), indicating weak statistical coupling in the data considered, whereas Source--Filter MI is substantially higher (0.47 nats). Attribution analysis, defined as the proportion of total MI attributable to source versus filter components, reveals source dominance for emotional dimensions (80\%) and filter dominance for linguistic and pathological dimensions (60\% and 58\%, respectively). These findings provide a principled framework for quantifying dimensional independence in speech.
Paper Structure (25 sections, 14 equations, 4 figures, 1 table)

This paper contains 25 sections, 14 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Proposed framework for quantifying dimensional independence.
  • Figure 2: Cross-dimension MI (Final) heatmap across dataset combinations. Lower values (lighter) indicate greater independence between feature sets. Top marginal bars show per-pair means averaged across all combinations.
  • Figure 3: MI estimation convergence across dimensions: (a) Emotion–Linguistic, (b) Emotion–Pathology, (c) Linguistic–Pathology, and (d) Source–Filter. Per-ensemble MINE and CLUB trajectories are shown (offset for visual clarity), alongside the KSG baseline (dashed) and the final consensus estimate for a representative dataset combination (RAVDESS, L2-ARCTIC, UA-Speech). The mean estimator gap ($\Delta$) over the final 10 training epochs is reported, demonstrating progressive reduction and convergence over training.
  • Figure 4: Source–Filter attribution across semantic dimensions. Stacked bars show the mean proportion of MI carried by Source vs. Filter components (Eq. 3); individual dataset combinations are overlaid as jitter points. Error bars indicate 95% CI.