Table of Contents
Fetching ...

Scoring with Large Language Models: A Study on Measuring Empathy of Responses in Dialogues

Henry J. Xie, Jinghan Zhang, Xinhao Zhang, Kunpeng Liu

TL;DR

The paper addresses how to quantify empathy scoring by LLMs in dialogues and seeks to understand the underlying scoring mechanisms. It proposes a framework that substitutes or augments LLM scores with explicit, interpretable features such as embeddings, MITI Code, and three-dimension subfactors, optimized via classifiers and feature selection. Findings show embeddings can reach near-baseline LLM performance, while MITI Code and explicit subfactors—with feature selection—can match fine-tuned LLMs, achieving peak accuracies around 54.69% for fine-tuned GPT-4o-mini and 53.65% with feature-selected combinations. This work provides a transparent, scalable approach to empathy scoring in social science contexts and informs future cross-domain studies on empathetic evaluation in dialogue systems.

Abstract

In recent years, Large Language Models (LLMs) have become increasingly more powerful in their ability to complete complex tasks. One such task in which LLMs are often employed is scoring, i.e., assigning a numerical value from a certain scale to a subject. In this paper, we strive to understand how LLMs score, specifically in the context of empathy scoring. We develop a novel and comprehensive framework for investigating how effective LLMs are at measuring and scoring empathy of responses in dialogues, and what methods can be employed to deepen our understanding of LLM scoring. Our strategy is to approximate the performance of state-of-the-art and fine-tuned LLMs with explicit and explainable features. We train classifiers using various features of dialogues including embeddings, the Motivational Interviewing Treatment Integrity (MITI) Code, a set of explicit subfactors of empathy as proposed by LLMs, and a combination of the MITI Code and the explicit subfactors. Our results show that when only using embeddings, it is possible to achieve performance close to that of generic LLMs, and when utilizing the MITI Code and explicit subfactors scored by an LLM, the trained classifiers can closely match the performance of fine-tuned LLMs. We employ feature selection methods to derive the most crucial features in the process of empathy scoring. Our work provides a new perspective toward understanding LLM empathy scoring and helps the LLM community explore the potential of LLM scoring in social science studies.

Scoring with Large Language Models: A Study on Measuring Empathy of Responses in Dialogues

TL;DR

The paper addresses how to quantify empathy scoring by LLMs in dialogues and seeks to understand the underlying scoring mechanisms. It proposes a framework that substitutes or augments LLM scores with explicit, interpretable features such as embeddings, MITI Code, and three-dimension subfactors, optimized via classifiers and feature selection. Findings show embeddings can reach near-baseline LLM performance, while MITI Code and explicit subfactors—with feature selection—can match fine-tuned LLMs, achieving peak accuracies around 54.69% for fine-tuned GPT-4o-mini and 53.65% with feature-selected combinations. This work provides a transparent, scalable approach to empathy scoring in social science contexts and informs future cross-domain studies on empathetic evaluation in dialogue systems.

Abstract

In recent years, Large Language Models (LLMs) have become increasingly more powerful in their ability to complete complex tasks. One such task in which LLMs are often employed is scoring, i.e., assigning a numerical value from a certain scale to a subject. In this paper, we strive to understand how LLMs score, specifically in the context of empathy scoring. We develop a novel and comprehensive framework for investigating how effective LLMs are at measuring and scoring empathy of responses in dialogues, and what methods can be employed to deepen our understanding of LLM scoring. Our strategy is to approximate the performance of state-of-the-art and fine-tuned LLMs with explicit and explainable features. We train classifiers using various features of dialogues including embeddings, the Motivational Interviewing Treatment Integrity (MITI) Code, a set of explicit subfactors of empathy as proposed by LLMs, and a combination of the MITI Code and the explicit subfactors. Our results show that when only using embeddings, it is possible to achieve performance close to that of generic LLMs, and when utilizing the MITI Code and explicit subfactors scored by an LLM, the trained classifiers can closely match the performance of fine-tuned LLMs. We employ feature selection methods to derive the most crucial features in the process of empathy scoring. Our work provides a new perspective toward understanding LLM empathy scoring and helps the LLM community explore the potential of LLM scoring in social science studies.
Paper Structure (16 sections, 8 figures, 2 tables)

This paper contains 16 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Our Methodology: Dataset, Models, Feature Sets, and Steps (Upper); Scoring Accuracy Achieved with Different Models and Feature Sets (Lower)
  • Figure 2: Scoring Accuracy of Classifiers on Embeddings
  • Figure 3: Scoring Accuracy of Classifiers on MITI Code
  • Figure 4: 3 Dimensions of Empathy and Their Subfactors
  • Figure 5: Scoring Accuracy of Different Prompt Combinations on GPT-4o-mini
  • ...and 3 more figures