Table of Contents
Fetching ...

Dynamic Recognition of Speakers for Consent Management by Contrastive Embedding Replay

Arash Shahmansoori, Utz Roedig

TL;DR

A contrastive-based training is applied to learn the underlying speaker equivariance inductive bias and new methods for dynamic registration using a portion of old utterances, removal, and reregistration of speakers are proposed.

Abstract

Voice assistants overhear conversations and a consent management mechanism is required. Consent management can be implemented using speaker recognition. Users that do not give consent enrol their voice and all their further recordings are discarded. Building speaker recognition-based consent management is challenging as dynamic registration, removal, and re-registration of speakers must be efficiently handled. This work proposes a consent management system addressing the aforementioned challenges. A contrastive based training is applied to learn the underlying speaker equivariance inductive bias. The contrastive features for buckets of speakers are trained a few steps into each iteration and act as replay buffers. These features are progressively selected using a multi-strided random sampler for classification. Moreover, new methods for dynamic registration using a portion of old utterances, removal, and re-registration of speakers are proposed. The results verify memory efficiency and dynamic capabilities of the proposed methods and outperform the existing approach from the literature.

Dynamic Recognition of Speakers for Consent Management by Contrastive Embedding Replay

TL;DR

A contrastive-based training is applied to learn the underlying speaker equivariance inductive bias and new methods for dynamic registration using a portion of old utterances, removal, and reregistration of speakers are proposed.

Abstract

Voice assistants overhear conversations and a consent management mechanism is required. Consent management can be implemented using speaker recognition. Users that do not give consent enrol their voice and all their further recordings are discarded. Building speaker recognition-based consent management is challenging as dynamic registration, removal, and re-registration of speakers must be efficiently handled. This work proposes a consent management system addressing the aforementioned challenges. A contrastive based training is applied to learn the underlying speaker equivariance inductive bias. The contrastive features for buckets of speakers are trained a few steps into each iteration and act as replay buffers. These features are progressively selected using a multi-strided random sampler for classification. Moreover, new methods for dynamic registration using a portion of old utterances, removal, and re-registration of speakers are proposed. The results verify memory efficiency and dynamic capabilities of the proposed methods and outperform the existing approach from the literature.
Paper Structure (15 sections, 9 equations, 8 figures, 3 tables, 8 algorithms)

This paper contains 15 sections, 9 equations, 8 figures, 3 tables, 8 algorithms.

Figures (8)

  • Figure 1: The process for the proposed training with contrastive embedding replay for an agent.
  • Figure 2: Pictorial viewpoint of the proposed method in the inference mode for a given agent.
  • Figure 3: The comparison between testing accuracies and losses of an agent using the proposed contrastive embedding replay, with multi-strided progressive sampling in supervised and unsupervised modes, and the baseline method from the literature with respect to the elapsed time for training.
  • Figure 4: The comparison between testing accuracies of different agents using the proposed contrastive embedding replay, with multi-strided progressive sampling in supervised and unsupervised modes, and the baseline method from the literature with respect to the elapsed time for training.
  • Figure 5: The testing accuracies per round for dynamic (top) supervised and (bottom) unsupervised registrations with respect to required elapsed time to break the registration loop. Different markers and colors are used to distinguish between different rounds of dynamic registrations. The corresponding values by re-training the network per rounds are reported by different markers.
  • ...and 3 more figures