Modeling Overlapped Speech with Shuffles

Matthew Wiesner; Samuele Cornell; Alexander Polok; Lucas Ondel Yang; Lukáš Burget; Sanjeev Khudanpur

Modeling Overlapped Speech with Shuffles

Matthew Wiesner, Samuele Cornell, Alexander Polok, Lucas Ondel Yang, Lukáš Burget, Sanjeev Khudanpur

Abstract

We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We train using the total score on these FSAs as a loss function, marginalizing over all possible serializations of overlapping sequences at subword, word, and phrase levels. To reduce graph size, we impose temporal constraints by constructing partial order FSAs. We address speaker attribution by modeling (token, speaker) tuples directly. Viterbi alignment through the shuffle product FSA directly enables one-pass alignment. We evaluate performance on synthetic LibriSpeech overlaps. To our knowledge, this is the first algorithm that enables single-pass alignment of multi-talker recordings. All algorithms are implemented using k2 / Icefall.

Modeling Overlapped Speech with Shuffles

Abstract

Paper Structure (30 sections, 17 equations, 2 figures, 5 tables)

This paper contains 30 sections, 17 equations, 2 figures, 5 tables.

Introduction
Related Work
Method
Shuffles for Multi-talker Speech
Partial Orders and Serialization
Relationship to existing serialization schemes
CTC with Shuffles
Extending CTC to Utterance Groups with Shuffles
Speaker Attribution with Shuffles
Multi-talker Decoding
1-pass Decoding
N-pass Decoding
Relationship to Speaker Distinguishable CTC
Aligning Overlapped Speech
Metrics
...and 15 more sections

Figures (2)

Figure 1: FSAs representing different utterance group serializations. Tuples on states represent indices into the two serialized sequences. Colors represent unique sources. (a) The shuffle product FSA between two sequences $y_1 = abc, y_2 = \color{red}xy$. (b) The pruned shuffle FSA obtainable from application of partial order constraints (see Section \ref{['sec:partial_orders']}). (c) The subgraph corresponding to utterance-level serialized output training (SOT). (d) The subgraph corresponding to token-level SOT, assuming token order $a \leq \color{red}x\color{black} \leq b \leq \color{red}y\color{black} \leq c$.
Figure 2: The compact selfless label topology from laptev2022ctc. Note that, different from the traditional CTC-topology, it forces tokens to appear on exactly one frame, and forces $\oslash$ between every token in the transcript, i.e., between "a" and a subsequent "b".

Modeling Overlapped Speech with Shuffles

Abstract

Modeling Overlapped Speech with Shuffles

Authors

Abstract

Table of Contents

Figures (2)