Table of Contents
Fetching ...

Modeling Overlapped Speech with Shuffles

Matthew Wiesner, Samuele Cornell, Alexander Polok, Lucas Ondel Yang, Lukáš Burget, Sanjeev Khudanpur

Abstract

We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We train using the total score on these FSAs as a loss function, marginalizing over all possible serializations of overlapping sequences at subword, word, and phrase levels. To reduce graph size, we impose temporal constraints by constructing partial order FSAs. We address speaker attribution by modeling (token, speaker) tuples directly. Viterbi alignment through the shuffle product FSA directly enables one-pass alignment. We evaluate performance on synthetic LibriSpeech overlaps. To our knowledge, this is the first algorithm that enables single-pass alignment of multi-talker recordings. All algorithms are implemented using k2 / Icefall.

Modeling Overlapped Speech with Shuffles

Abstract

We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We train using the total score on these FSAs as a loss function, marginalizing over all possible serializations of overlapping sequences at subword, word, and phrase levels. To reduce graph size, we impose temporal constraints by constructing partial order FSAs. We address speaker attribution by modeling (token, speaker) tuples directly. Viterbi alignment through the shuffle product FSA directly enables one-pass alignment. We evaluate performance on synthetic LibriSpeech overlaps. To our knowledge, this is the first algorithm that enables single-pass alignment of multi-talker recordings. All algorithms are implemented using k2 / Icefall.
Paper Structure (30 sections, 17 equations, 2 figures, 5 tables)

This paper contains 30 sections, 17 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: FSAs representing different utterance group serializations. Tuples on states represent indices into the two serialized sequences. Colors represent unique sources. (a) The shuffle product FSA between two sequences $y_1 = abc, y_2 = \color{red}xy$. (b) The pruned shuffle FSA obtainable from application of partial order constraints (see Section \ref{['sec:partial_orders']}). (c) The subgraph corresponding to utterance-level serialized output training (SOT). (d) The subgraph corresponding to token-level SOT, assuming token order $a \leq \color{red}x\color{black} \leq b \leq \color{red}y\color{black} \leq c$.
  • Figure 2: The compact selfless label topology from laptev2022ctc. Note that, different from the traditional CTC-topology, it forces tokens to appear on exactly one frame, and forces $\oslash$ between every token in the transcript, i.e., between "a" and a subsequent "b".