Table of Contents
Fetching ...

A Benchmark for Multi-speaker Anonymization

Xiaoxiao Miao, Ruijie Tao, Chang Zeng, Xin Wang

TL;DR

This work tackles the gap in privacy-preserving speech research for multi-speaker conversations by establishing a benchmark for multi-speaker anonymization (MSA). It proposes a cascaded system combining speaker diarization with disentanglement-based anonymization, and introduces two conversation-level anonymizers (differential similarity and aggregated similarity) to maintain intra-conversation speaker relationships while preserving distinctiveness. The authors evaluate on simulated and real datasets, using FAR for privacy and WER, PMOS, and DER for utility, and analyze privacy leakage in overlapping speech, offering lightweight mitigation ideas. The results show that the proposed DS/AS strategies improve speaker distinctiveness and privacy protection compared to SSA baselines, providing a practical baseline and guidance for deploying privacy-preserving multi-speaker analytics in real-world settings.

Abstract

Privacy-preserving voice protection approaches primarily suppress privacy-related information derived from paralinguistic attributes while preserving the linguistic content. Existing solutions focus particularly on single-speaker scenarios. However, they lack practicality for real-world applications, i.e., multi-speaker scenarios. In this paper, we present an initial attempt to provide a multi-speaker anonymization benchmark by defining the task and evaluation protocol, proposing benchmarking solutions, and discussing the privacy leakage of overlapping conversations. The proposed benchmark solutions are based on a cascaded system that integrates spectral-clustering-based speaker diarization and disentanglement-based speaker anonymization using a selection-based anonymizer. To improve utility, the benchmark solutions are further enhanced by two conversation-level speaker vector anonymization methods. The first method minimizes the differential similarity across speaker pairs in the original and anonymized conversations, which maintains original speaker relationships in the anonymized version. The other minimizes the aggregated similarity across anonymized speakers, which achieves better differentiation between speakers.Experiments conducted on both non-overlap simulated and real-world datasets demonstrate the effectiveness of the multi-speaker anonymization system with the proposed speaker anonymizers. Additionally, we analyzed overlapping speech regarding privacy leakage and provided potential solutions

A Benchmark for Multi-speaker Anonymization

TL;DR

This work tackles the gap in privacy-preserving speech research for multi-speaker conversations by establishing a benchmark for multi-speaker anonymization (MSA). It proposes a cascaded system combining speaker diarization with disentanglement-based anonymization, and introduces two conversation-level anonymizers (differential similarity and aggregated similarity) to maintain intra-conversation speaker relationships while preserving distinctiveness. The authors evaluate on simulated and real datasets, using FAR for privacy and WER, PMOS, and DER for utility, and analyze privacy leakage in overlapping speech, offering lightweight mitigation ideas. The results show that the proposed DS/AS strategies improve speaker distinctiveness and privacy protection compared to SSA baselines, providing a practical baseline and guidance for deploying privacy-preserving multi-speaker analytics in real-world settings.

Abstract

Privacy-preserving voice protection approaches primarily suppress privacy-related information derived from paralinguistic attributes while preserving the linguistic content. Existing solutions focus particularly on single-speaker scenarios. However, they lack practicality for real-world applications, i.e., multi-speaker scenarios. In this paper, we present an initial attempt to provide a multi-speaker anonymization benchmark by defining the task and evaluation protocol, proposing benchmarking solutions, and discussing the privacy leakage of overlapping conversations. The proposed benchmark solutions are based on a cascaded system that integrates spectral-clustering-based speaker diarization and disentanglement-based speaker anonymization using a selection-based anonymizer. To improve utility, the benchmark solutions are further enhanced by two conversation-level speaker vector anonymization methods. The first method minimizes the differential similarity across speaker pairs in the original and anonymized conversations, which maintains original speaker relationships in the anonymized version. The other minimizes the aggregated similarity across anonymized speakers, which achieves better differentiation between speakers.Experiments conducted on both non-overlap simulated and real-world datasets demonstrate the effectiveness of the multi-speaker anonymization system with the proposed speaker anonymizers. Additionally, we analyzed overlapping speech regarding privacy leakage and provided potential solutions
Paper Structure (43 sections, 5 equations, 9 figures, 11 tables, 1 algorithm)

This paper contains 43 sections, 5 equations, 9 figures, 11 tables, 1 algorithm.

Figures (9)

  • Figure 1: Privacy and utility evaluation for MSA. FAR metric assesses privacy, and WER, PMOS, and DER metrics assess utility. For FAR computation, "O-O pos" represents both the enrollment and test segments being from the same original speaker, while "O-O neg" represents those from different original speakers. "O-A" represents the enrollment segment being from the original speaker and the test segment being the corresponding anonymized segment. FAR is the ratio of the black lines outlined area to the yellow area.
  • Figure 2: Pipeline of cascaded MSA, where the SD module is first used to aggregate single-speaker segments, followed by disentanglement-based anonymization for individual anonymization.
  • Figure 3: Workflow of selection-based speaker anonymizer using an external speaker vector pool, adopted by VPC baseline systems. The input speaker vector $\boldsymbol{x}_{o}^{n}$ is anonymized by selecting $K$-farthest vectors in the pool $\mathcal{Y}_\text{a} = \{\boldsymbol{y}_\text{a}^1, \ldots, \boldsymbol{y}_\text{a}^P\}$. The anonymized output $\boldsymbol{x}_{a}^n$ is set to be the average of the $K$-farthest vectors.
  • Figure 4: Illustration of proposed differential and aggregated similarity-based anonymized speaker vector selection methods for $N=3$ speakers. Differential similarity constraints (middle) maintain original relationships (left), while aggregated similarity constraints (right) maximize speaker differentiation.
  • Figure 5: Sum of similarities for all combination speaker pairs per conversation, with each pair consisting of two different speakers, for the original data, $A_{AS}$, and $A_{DS}$ using predicted RTTM on clean simulation datasets.
  • ...and 4 more figures