Table of Contents
Fetching ...

Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue

Junkai Wu, Xulin Fan, Bo-Ru Lu, Xilin Jiang, Nima Mesgarani, Mark Hasegawa-Johnson, Mari Ostendorf

TL;DR

The paper interrogates whether SpeechLLMs truly leverage speaker identity in spoken dialogue, introducing a formal ICQ/CBQ framework and automatic labeling to separate identity-critical from context-based questions. Through experiments on Gaokao and a controlled What Do You Like? dataset, it finds that current SpeechLLMs and ASR+LLM baselines largely rely on transcript content and exhibit limited speaker differentiation, with identity-sensitive questions remaining challenging. The results highlight the need for evaluation paradigms and training objectives that explicitly target speaker identification, motivating future work on speaker-aware SQA benchmarks and model pretraining strategies. Overall, the work calls for stronger speaker-aware SQA capabilities to bridge the gap between human and machine understanding in spoken dialogue.

Abstract

In recent years, we have observed a rapid advancement in speech language models (SpeechLLMs), catching up with humans' listening and reasoning abilities. SpeechLLMs have demonstrated impressive spoken dialog question-answering (SQA) performance in benchmarks like Gaokao, the English listening test of the college entrance exam in China, which seemingly requires understanding both the spoken content and voice characteristics of speakers in a conversation. However, after carefully examining Gaokao's questions, we find the correct answers to many questions can be inferred from the conversation transcript alone, i.e.\ without speaker segmentation and identification. Our evaluation of state-of-the-art models Qwen-Audio and WavLLM on both Gaokao and our proposed "What Do You Like?" dataset shows a significantly higher accuracy in these context-based questions than in identity-critical questions, which can only be answered reliably with correct speaker identification. The results and analysis suggest that when solving SQA, the current SpeechLLMs exhibit limited speaker awareness from the audio and behave similarly to an LLM reasoning from the conversation transcription without sound. We propose that tasks focused on identity-critical questions could offer a more accurate evaluation framework of SpeechLLMs in SQA.

Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue

TL;DR

The paper interrogates whether SpeechLLMs truly leverage speaker identity in spoken dialogue, introducing a formal ICQ/CBQ framework and automatic labeling to separate identity-critical from context-based questions. Through experiments on Gaokao and a controlled What Do You Like? dataset, it finds that current SpeechLLMs and ASR+LLM baselines largely rely on transcript content and exhibit limited speaker differentiation, with identity-sensitive questions remaining challenging. The results highlight the need for evaluation paradigms and training objectives that explicitly target speaker identification, motivating future work on speaker-aware SQA benchmarks and model pretraining strategies. Overall, the work calls for stronger speaker-aware SQA capabilities to bridge the gap between human and machine understanding in spoken dialogue.

Abstract

In recent years, we have observed a rapid advancement in speech language models (SpeechLLMs), catching up with humans' listening and reasoning abilities. SpeechLLMs have demonstrated impressive spoken dialog question-answering (SQA) performance in benchmarks like Gaokao, the English listening test of the college entrance exam in China, which seemingly requires understanding both the spoken content and voice characteristics of speakers in a conversation. However, after carefully examining Gaokao's questions, we find the correct answers to many questions can be inferred from the conversation transcript alone, i.e.\ without speaker segmentation and identification. Our evaluation of state-of-the-art models Qwen-Audio and WavLLM on both Gaokao and our proposed "What Do You Like?" dataset shows a significantly higher accuracy in these context-based questions than in identity-critical questions, which can only be answered reliably with correct speaker identification. The results and analysis suggest that when solving SQA, the current SpeechLLMs exhibit limited speaker awareness from the audio and behave similarly to an LLM reasoning from the conversation transcription without sound. We propose that tasks focused on identity-critical questions could offer a more accurate evaluation framework of SpeechLLMs in SQA.
Paper Structure (18 sections, 4 figures, 1 table)

This paper contains 18 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Examples of an identity-critical question (ICQ) and two context-based questions (CBQs). The male (M) and female (F) indicators in the transcript are only included in oracle experiments.
  • Figure 2: Example generation of a question and different answer options for What Do You Like?. The correct answer for each condition (C1:C4) is indicated by an asterisk.
  • Figure 3: Number of times models choose each option for Condition 1 (left) and Condition 2 (right)
  • Figure 4: Number of times models choose each option for Condition 3 (left) and Condition 4 (right)