SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation

Kun Zhao; Bohao Yang; Chen Tang; Chenghua Lin; Liang Zhan

SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation

Kun Zhao, Bohao Yang, Chen Tang, Chenghua Lin, Liang Zhan

TL;DR

SLIDE tackles the one-to-many problem in open-domain dialogue evaluation by jointly leveraging a small, task-specific model (SLM) and large language models (LLMs). It trains the SLM via contrastive learning to bring context–positive responses closer and separate adversarial negatives, using robust/non-robust embedding disentanglement and multiple loss terms; it derives a $Score_{SLM}$ from a normalized distance $s_d$ and probability $s_p$ through $Score_{SLM}=1-s_d+s_p$. Evaluation combines $Score_{SLM}$ with a prompt-based $Score_{LLM}$ and a fusion rule to yield $Score$, balancing the strengths of both model types. Experiments on DailyDialog++, PersonaChat, and TopicalChat show SLIDE achieves state-of-the-art correlations with human judgments and that the hybrid approach outperforms either model alone, highlighting practical improvements for automatic open-domain dialogue evaluation. The work also demonstrates data augmentation with LLM-generated responses and offers a scalable, robust evaluator for real-world dialogue systems.

Abstract

The long-standing one-to-many problem of gold standard responses in open-domain dialogue systems presents challenges for automatic evaluation metrics. Though prior works have demonstrated some success by applying powerful Large Language Models (LLMs), existing approaches still struggle with the one-to-many problem, and exhibit subpar performance in domain-specific scenarios. We assume the commonsense reasoning biases within LLMs may hinder their performance in domainspecific evaluations. To address both issues, we propose a novel framework SLIDE (Small and Large Integrated for Dialogue Evaluation), that leverages both a small, specialised model (SLM), and LLMs for the evaluation of open domain dialogues. Our approach introduces several techniques: (1) Contrastive learning to differentiate between robust and non-robust response embeddings; (2) A novel metric for semantic sensitivity that combines embedding cosine distances with similarity learned through neural networks, and (3) a strategy for incorporating the evaluation results from both the SLM and LLMs. Our empirical results demonstrate that our approach achieves state-of-the-art performance in both the classification and evaluation tasks, and additionally the SLIDE evaluator exhibits better correlation with human judgements. Our code is available at https:// github.com/hegehongcha/SLIDE-ACL2024.

SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation

TL;DR

from a normalized distance

and probability

through

. Evaluation combines

with a prompt-based

and a fusion rule to yield

, balancing the strengths of both model types. Experiments on DailyDialog++, PersonaChat, and TopicalChat show SLIDE achieves state-of-the-art correlations with human judgments and that the hybrid approach outperforms either model alone, highlighting practical improvements for automatic open-domain dialogue evaluation. The work also demonstrates data augmentation with LLM-generated responses and offers a scalable, robust evaluator for real-world dialogue systems.

Abstract

Paper Structure (24 sections, 9 equations, 3 figures, 4 tables)

This paper contains 24 sections, 9 equations, 3 figures, 4 tables.

Introduction
Related Work
Dialogue evaluation metrics
LLM-based Evaluators
Methodology
Model Architecture
Training Process
Evaluation Process
Experimental Setup
Dataset
Baselines
Evaluation Set
Experimental Results
Dialogue Classification Task
Dialogue Evaluation
...and 9 more sections

Figures (3)

Figure 1: The architecture of the proposed model. We first use an SLM trained by constrastive learning to calculate the distance between context and responses. Following this, we calculate the probability of a response being positive and the cosine distance between context and reponse, in which case we then use them to acquire $score_{SLM}$. Secondly, we use an LLM to acquire $score_{LLM}$. Finally, we acquire the final score in accordance to our findings that LLM are more inclined to recognise negative responses correctly whilst SLM recognise positive responses better.
Figure 2: T-SNE visualisation of the sentence representation of context and responses. The left panel, labeled Normal, illustrates the vectors prior to disentanglement, whereas the right panel, labeled Disentangled, displays the post-disentanglement outcomes. This demonstrates the convergence of negative responses towards the context following disentanglement.
Figure 3: T-SNE visualisation of the sentence representation of context and responses for some examples. Each example includes a context, five positive responses, and five adversarial negative responses. The left represents the vector prior to disentanglement which is titled "Normal", whilst the right is after disentanglement which is titled "Disentangled". These figures demonstrate the positive responses become nearer to context after disentangling.

SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation

TL;DR

Abstract

SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)