Table of Contents
Fetching ...

Socio-Emotional Response Generation: A Human Evaluation Protocol for LLM-Based Conversational Systems

Lorraine Vanel, Ariel R. Ramos Vela, Alya Yacoubi, Chloé Clavel

TL;DR

The paper addresses the opacity of socio-emotional strategies in LLM-based conversational systems and the insufficiency of automated metrics for evaluating such content. It introduces a planning module that predicts a sequence of socio-emotional labels to guide response generation, and compares open-source baselines with and without this planning step. A novel human evaluation protocol combines coarse consistency checks with fine-grained annotations on socio-emotional criteria, demonstrating that planning the label sequence before generation improves performance over end-to-end generation, while also revealing limitations of current automatic metrics. The authors publicly release annotation tooling and data to facilitate future benchmarking and development of trustworthy, transparent conversational systems.

Abstract

Conversational systems are now capable of producing impressive and generally relevant responses. However, we have no visibility nor control of the socio-emotional strategies behind state-of-the-art Large Language Models (LLMs), which poses a problem in terms of their transparency and thus their trustworthiness for critical applications. Another issue is that current automated metrics are not able to properly evaluate the quality of generated responses beyond the dataset's ground truth. In this paper, we propose a neural architecture that includes an intermediate step in planning socio-emotional strategies before response generation. We compare the performance of open-source baseline LLMs to the outputs of these same models augmented with our planning module. We also contrast the outputs obtained from automated metrics and evaluation results provided by human annotators. We describe a novel evaluation protocol that includes a coarse-grained consistency evaluation, as well as a finer-grained annotation of the responses on various social and emotional criteria. Our study shows that predicting a sequence of expected strategy labels and using this sequence to generate a response yields better results than a direct end-to-end generation scheme. It also highlights the divergences and the limits of current evaluation metrics for generated content. The code for the annotation platform and the annotated data are made publicly available for the evaluation of future models.

Socio-Emotional Response Generation: A Human Evaluation Protocol for LLM-Based Conversational Systems

TL;DR

The paper addresses the opacity of socio-emotional strategies in LLM-based conversational systems and the insufficiency of automated metrics for evaluating such content. It introduces a planning module that predicts a sequence of socio-emotional labels to guide response generation, and compares open-source baselines with and without this planning step. A novel human evaluation protocol combines coarse consistency checks with fine-grained annotations on socio-emotional criteria, demonstrating that planning the label sequence before generation improves performance over end-to-end generation, while also revealing limitations of current automatic metrics. The authors publicly release annotation tooling and data to facilitate future benchmarking and development of trustworthy, transparent conversational systems.

Abstract

Conversational systems are now capable of producing impressive and generally relevant responses. However, we have no visibility nor control of the socio-emotional strategies behind state-of-the-art Large Language Models (LLMs), which poses a problem in terms of their transparency and thus their trustworthiness for critical applications. Another issue is that current automated metrics are not able to properly evaluate the quality of generated responses beyond the dataset's ground truth. In this paper, we propose a neural architecture that includes an intermediate step in planning socio-emotional strategies before response generation. We compare the performance of open-source baseline LLMs to the outputs of these same models augmented with our planning module. We also contrast the outputs obtained from automated metrics and evaluation results provided by human annotators. We describe a novel evaluation protocol that includes a coarse-grained consistency evaluation, as well as a finer-grained annotation of the responses on various social and emotional criteria. Our study shows that predicting a sequence of expected strategy labels and using this sequence to generate a response yields better results than a direct end-to-end generation scheme. It also highlights the divergences and the limits of current evaluation metrics for generated content. The code for the annotation platform and the annotated data are made publicly available for the evaluation of future models.

Paper Structure

This paper contains 15 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: 4 dialogue acts used to annotate Daily Dialog, as well as some examples from the dataset to assist this task.
  • Figure 2: Definition of each socio-emotional criteria rated in this evaluation, as well as the rating scale used for each item