Thinking in cocktail party: Chain-of-Thought and reinforcement learning for target speaker automatic speech recognition
Yiru Zhang, Hang Su, Lichun Fan, Zhenbo Luo, Jian Luan
TL;DR
The paper addresses target-speaker ASR (TS-ASR) in cocktail-party settings and proposes a reasoning-guided framework that integrates Chain-of-Thought (CoT) and Reinforcement Learning (RL) into Large Audio-Language Models (LALMs) to improve transcription accuracy. It introduces a three-stage training pipeline (BASE SFT, CoT fine-tuning, RL refinement) within an LALM-based TS-ASR architecture that uses a Data2Vec2 encoder and an audio prompt formed by concatenating a target reference with the mixed speech. A novel CoT data construction method creates structured reasoning templates with discrete similarity levels, while RL uses Group Relative Policy Optimization (GRPO) with a joint WER and CoT-format reward to sharpen reasoning and transcription quality. Experimental results on Libri2Mix and Libri3Mix show state-of-the-art WER reductions, with further gains in single-speaker TS-ASR, demonstrating the value of explicit reasoning and targeted RL in complex auditory scenes.
Abstract
Target Speaker Automatic Speech Recognition (TS-ASR) aims to transcribe the speech of a specified target speaker from multi-speaker mixtures in cocktail party scenarios. Recent advancement of Large Audio-Language Models (LALMs) has already brought some new insights to TS-ASR. However, significant room for optimization remains for the TS-ASR task within the LALMs architecture. While Chain of Thoughts (CoT) and Reinforcement Learning (RL) have proven effective in certain speech tasks, TS-ASR, which requires the model to deeply comprehend speech signals, differentiate various speakers, and handle overlapping utterances is particularly well-suited to a reasoning-guided approach. Therefore, we propose a novel framework that incorporates CoT and RL training into TS-ASR for performance improvement. A novel CoT dataset of TS-ASR is constructed, and the TS-ASR model is first trained on regular data and then fine-tuned on CoT data. Finally, the model is further trained with RL using selected data to enhance generalized reasoning capabilities. Experiment results show a significant improvement of TS-ASR performance with CoT and RL training, which demonstrates the effectiveness of the proposed CoT and RL training methods adapted for the TS-ASR task.
