Table of Contents
Fetching ...

Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition

Hao Shi, Yuan Gao, Zhaoheng Ni, Tatsuya Kawahara

TL;DR

This paper tackles multi-speaker ASR under overlapped speech by enhancing serialized output training (SOT) with two strategies. EncSep introduces a separator after the encoder to produce single-speaker encodings and uses a CTC-Attention hybrid loss to refine encoder representations, while keeping decoding cost unchanged. GEncSep further exploits separated encodings by concatenating them and guiding decoding with attention, yielding additional gains. On LibriMix, the proposed methods deliver notable improvements, especially in noisy and multi-speaker conditions, demonstrating practical potential for robust end-to-end multi-speaker ASR without heavy front-end separation.

Abstract

Serialized output training (SOT) attracts increasing attention due to its convenience and flexibility for multi-speaker automatic speech recognition (ASR). However, it is not easy to train with attention loss only. In this paper, we propose the overlapped encoding separation (EncSep) to fully utilize the benefits of the connectionist temporal classification (CTC) and attention hybrid loss. This additional separator is inserted after the encoder to extract the multi-speaker information with CTC losses. Furthermore, we propose the serialized speech information guidance SOT (GEncSep) to further utilize the separated encodings. The separated streams are concatenated to provide single-speaker information to guide attention during decoding. The experimental results on LibriMix show that the single-speaker encoding can be separated from the overlapped encoding. The CTC loss helps to improve the encoder representation under complex scenarios. GEncSep further improved performance.

Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition

TL;DR

This paper tackles multi-speaker ASR under overlapped speech by enhancing serialized output training (SOT) with two strategies. EncSep introduces a separator after the encoder to produce single-speaker encodings and uses a CTC-Attention hybrid loss to refine encoder representations, while keeping decoding cost unchanged. GEncSep further exploits separated encodings by concatenating them and guiding decoding with attention, yielding additional gains. On LibriMix, the proposed methods deliver notable improvements, especially in noisy and multi-speaker conditions, demonstrating practical potential for robust end-to-end multi-speaker ASR without heavy front-end separation.

Abstract

Serialized output training (SOT) attracts increasing attention due to its convenience and flexibility for multi-speaker automatic speech recognition (ASR). However, it is not easy to train with attention loss only. In this paper, we propose the overlapped encoding separation (EncSep) to fully utilize the benefits of the connectionist temporal classification (CTC) and attention hybrid loss. This additional separator is inserted after the encoder to extract the multi-speaker information with CTC losses. Furthermore, we propose the serialized speech information guidance SOT (GEncSep) to further utilize the separated encodings. The separated streams are concatenated to provide single-speaker information to guide attention during decoding. The experimental results on LibriMix show that the single-speaker encoding can be separated from the overlapped encoding. The CTC loss helps to improve the encoder representation under complex scenarios. GEncSep further improved performance.
Paper Structure (12 sections, 11 equations, 1 figure, 3 tables)