Table of Contents
Fetching ...

An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue

Koji Inoue, Divesh Lala, Mikey Elmers, Keiko Ochi, Tatsuya Kawahara

TL;DR

This work addresses addressee recognition in multi-modal, multi-party dialogues by introducing the TEIDAN triadic corpus and a dedicated LLM benchmark. It evaluates GPT-4o on addressee recognition and next-speaker prediction, including an attempt to augment with gaze features. Results show the model performs only marginally above chance for addressee recognition and below chance for next-speaker prediction, even when gaze cues are included. The findings reveal a substantial gap between current LLM capabilities and the demands of multi-party conversational dynamics, highlighting the need for advanced multimodal integration and new modeling approaches.

Abstract

Handling multi-party dialogues represents a significant step for advancing spoken dialogue systems, necessitating the development of tasks specific to multi-party interactions. To address this challenge, we are constructing a multi-modal multi-party dialogue corpus of triadic (three-participant) discussions. This paper focuses on the task of addressee recognition, identifying who is being addressed to take the next turn, a critical component unique to multi-party dialogue systems. A subset of the corpus was annotated with addressee information, revealing that explicit addressees are indicated in approximately 20% of conversational turns. To evaluate the task's complexity, we benchmarked the performance of a large language model (GPT-4o) on addressee recognition. The results showed that GPT-4o achieved an accuracy only marginally above chance, underscoring the challenges of addressee recognition in multi-party dialogue. These findings highlight the need for further research to enhance the capabilities of large language models in understanding and navigating the intricacies of multi-party conversational dynamics.

An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue

TL;DR

This work addresses addressee recognition in multi-modal, multi-party dialogues by introducing the TEIDAN triadic corpus and a dedicated LLM benchmark. It evaluates GPT-4o on addressee recognition and next-speaker prediction, including an attempt to augment with gaze features. Results show the model performs only marginally above chance for addressee recognition and below chance for next-speaker prediction, even when gaze cues are included. The findings reveal a substantial gap between current LLM capabilities and the demands of multi-party conversational dynamics, highlighting the need for advanced multimodal integration and new modeling approaches.

Abstract

Handling multi-party dialogues represents a significant step for advancing spoken dialogue systems, necessitating the development of tasks specific to multi-party interactions. To address this challenge, we are constructing a multi-modal multi-party dialogue corpus of triadic (three-participant) discussions. This paper focuses on the task of addressee recognition, identifying who is being addressed to take the next turn, a critical component unique to multi-party dialogue systems. A subset of the corpus was annotated with addressee information, revealing that explicit addressees are indicated in approximately 20% of conversational turns. To evaluate the task's complexity, we benchmarked the performance of a large language model (GPT-4o) on addressee recognition. The results showed that GPT-4o achieved an accuracy only marginally above chance, underscoring the challenges of addressee recognition in multi-party dialogue. These findings highlight the need for further research to enhance the capabilities of large language models in understanding and navigating the intricacies of multi-party conversational dynamics.

Paper Structure

This paper contains 9 sections, 1 figure, 6 tables.

Figures (1)

  • Figure 1: A snapshot from TEIDAN corpus