Table of Contents
Fetching ...

Hyperbolic Distance-Based Speech Separation

Darius Petermann, Minje Kim

TL;DR

This work redefines single-channel speech separation as a hierarchical task implemented on a hyperbolic manifold, using a two-level structure that separates sources by distance to the microphone (near vs far) and then isolates individual speakers within each group. Embeddings from a BLSTM are projected to the Poincaré ball via $H=\exp^c_0(Z)$ and classified with a Hyperbolic Multinomial Logistic Regression, employing a two-tier hierarchical softmax to produce parent and child masks. Experiments show that hyperbolic learning yields clearer benefits on the more complex two- and three-child hierarchies, particularly for child-level separation, while providing a natural notion of certainty that correlates with acoustic configurations such as speaker density and source-m microphone geometry. The results highlight the potential of hyperbolic geometry for interpretable, geometry-aware audio tasks and set the stage for uncertainty-guided processing in real-world meeting and conferencing scenarios. Code will be released to support reproducibility and further exploration.

Abstract

In this work, we explore the task of hierarchical distance-based speech separation defined on a hyperbolic manifold. Based on the recent advent of audio-related tasks performed in non-Euclidean spaces, we propose to make use of the Poincaré ball to effectively unveil the inherent hierarchical structure found in complex speaker mixtures. We design two sets of experiments in which the distance-based parent sound classes, namely "near" and "far", can contain up to two or three speakers (i.e., children) each. We show that our hyperbolic approach is suitable for unveiling hierarchical structure from the problem definition, resulting in improved child-level separation. We further show that a clear correlation emerges between the notion of hyperbolic certainty (i.e., the distance to the ball's origin) and acoustic semantics such as speaker density, inter-source location, and microphone-to-speaker distance.

Hyperbolic Distance-Based Speech Separation

TL;DR

This work redefines single-channel speech separation as a hierarchical task implemented on a hyperbolic manifold, using a two-level structure that separates sources by distance to the microphone (near vs far) and then isolates individual speakers within each group. Embeddings from a BLSTM are projected to the Poincaré ball via and classified with a Hyperbolic Multinomial Logistic Regression, employing a two-tier hierarchical softmax to produce parent and child masks. Experiments show that hyperbolic learning yields clearer benefits on the more complex two- and three-child hierarchies, particularly for child-level separation, while providing a natural notion of certainty that correlates with acoustic configurations such as speaker density and source-m microphone geometry. The results highlight the potential of hyperbolic geometry for interpretable, geometry-aware audio tasks and set the stage for uncertainty-guided processing in real-world meeting and conferencing scenarios. Code will be released to support reproducibility and further exploration.

Abstract

In this work, we explore the task of hierarchical distance-based speech separation defined on a hyperbolic manifold. Based on the recent advent of audio-related tasks performed in non-Euclidean spaces, we propose to make use of the Poincaré ball to effectively unveil the inherent hierarchical structure found in complex speaker mixtures. We design two sets of experiments in which the distance-based parent sound classes, namely "near" and "far", can contain up to two or three speakers (i.e., children) each. We show that our hyperbolic approach is suitable for unveiling hierarchical structure from the problem definition, resulting in improved child-level separation. We further show that a clear correlation emerges between the notion of hyperbolic certainty (i.e., the distance to the ball's origin) and acoustic semantics such as speaker density, inter-source location, and microphone-to-speaker distance.
Paper Structure (11 sections, 3 equations, 2 figures, 3 tables)

This paper contains 11 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Illustration of distance-based source separation performed on the Poincaré ball. Top-left is a projection of a three-speaker mixture, where two speakers (green and blue) belong to the near field ($\leq\!\tau=\!0.8$ meter) while one (red) to the far field ($>\!\tau$). Top-right shows the room configuration of the same mixture. Note that the source at the near-far field boundary tends to be projected at around the center of the Poincaré ball, reflecting the model's uncertainty. The bottom plots denote projections of two speakers that are progressively placed closer to $\tau$ (thus no hierarchy).
  • Figure 2: Distributions of $L_2$ norms to the origin of the Poincaré ball for all embeddings in their dedicated testing-set, as a function of various acoustics paradigms, such as speakers configuration (left), sources (center), and microphone (right) distances.