Table of Contents
Fetching ...

Do LLMs suffer from Multi-Party Hangover? A Diagnostic Approach to Addressee Recognition and Response Selection in Conversations

Nicolò Penzo, Maryam Sajedinia, Bruno Lepri, Sara Tonelli, Marco Guerini

TL;DR

Results show that response selection relies more on the textual content of conversations, while addressee recognition requires capturing their structural dimension, and highlights how sensitivity to prompt variations is task-dependent.

Abstract

Assessing the performance of systems to classify Multi-Party Conversations (MPC) is challenging due to the interconnection between linguistic and structural characteristics of conversations. Conventional evaluation methods often overlook variances in model behavior across different levels of structural complexity on interaction graphs. In this work, we propose a methodological pipeline to investigate model performance across specific structural attributes of conversations. As a proof of concept we focus on Response Selection and Addressee Recognition tasks, to diagnose model weaknesses. To this end, we extract representative diagnostic subdatasets with a fixed number of users and a good structural variety from a large and open corpus of online MPCs. We further frame our work in terms of data minimization, avoiding the use of original usernames to preserve privacy, and propose alternatives to using original text messages. Results show that response selection relies more on the textual content of conversations, while addressee recognition requires capturing their structural dimension. Using an LLM in a zero-shot setting, we further highlight how sensitivity to prompt variations is task-dependent.

Do LLMs suffer from Multi-Party Hangover? A Diagnostic Approach to Addressee Recognition and Response Selection in Conversations

TL;DR

Results show that response selection relies more on the textual content of conversations, while addressee recognition requires capturing their structural dimension, and highlights how sensitivity to prompt variations is task-dependent.

Abstract

Assessing the performance of systems to classify Multi-Party Conversations (MPC) is challenging due to the interconnection between linguistic and structural characteristics of conversations. Conventional evaluation methods often overlook variances in model behavior across different levels of structural complexity on interaction graphs. In this work, we propose a methodological pipeline to investigate model performance across specific structural attributes of conversations. As a proof of concept we focus on Response Selection and Addressee Recognition tasks, to diagnose model weaknesses. To this end, we extract representative diagnostic subdatasets with a fixed number of users and a good structural variety from a large and open corpus of online MPCs. We further frame our work in terms of data minimization, avoiding the use of original usernames to preserve privacy, and propose alternatives to using original text messages. Results show that response selection relies more on the textual content of conversations, while addressee recognition requires capturing their structural dimension. Using an LLM in a zero-shot setting, we further highlight how sensitivity to prompt variations is task-dependent.
Paper Structure (23 sections, 2 equations, 16 figures, 3 tables)

This paper contains 23 sections, 2 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: A graphical representation of the experiments. Each turn in a conversation includes a speaker, an addressee and a textual message. From the conversation, we extract the interaction graph to diagnose model capabilities by performing two tasks: addressee recognition and response selection.
  • Figure 2: Example of the $4$ possible conversation representations: i. Conversation Transcript (top left), ii. Interaction Transcript (top right), iii. Summary (bottom left) and iv. User Description (bottom right).
  • Figure 3: Example of the beginning of the system prompt in the three prompt schemes, from the most verbose (top) to the most concise (bottom).
  • Figure 4: Schematic representation of our evaluation pipeline: on the left, the pipeline and the relation among the elements; on the right, the type of diagnostic evaluation we can perform.
  • Figure 5: AR and RS macro-accuracy results ($y$ axis), for each combination and for each dataset. The height of the columns represents the best macro result across the three prompt schemes. Note that for AR the number of classes on each Ubuntu subset changes, ranging from four (Ubuntu3) to seven (Ubuntu6), since the set of possible addressees includes the speakers involved in each conversation, plus the dummy label. For this reason, results across different Ubuntu subsets on AR should not be compared, and the lowest accuracy is achieved on Ubuntu6.
  • ...and 11 more figures