SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation
Kun Zhao, Bohao Yang, Chen Tang, Chenghua Lin, Liang Zhan
TL;DR
SLIDE tackles the one-to-many problem in open-domain dialogue evaluation by jointly leveraging a small, task-specific model (SLM) and large language models (LLMs). It trains the SLM via contrastive learning to bring context–positive responses closer and separate adversarial negatives, using robust/non-robust embedding disentanglement and multiple loss terms; it derives a $Score_{SLM}$ from a normalized distance $s_d$ and probability $s_p$ through $Score_{SLM}=1-s_d+s_p$. Evaluation combines $Score_{SLM}$ with a prompt-based $Score_{LLM}$ and a fusion rule to yield $Score$, balancing the strengths of both model types. Experiments on DailyDialog++, PersonaChat, and TopicalChat show SLIDE achieves state-of-the-art correlations with human judgments and that the hybrid approach outperforms either model alone, highlighting practical improvements for automatic open-domain dialogue evaluation. The work also demonstrates data augmentation with LLM-generated responses and offers a scalable, robust evaluator for real-world dialogue systems.
Abstract
The long-standing one-to-many problem of gold standard responses in open-domain dialogue systems presents challenges for automatic evaluation metrics. Though prior works have demonstrated some success by applying powerful Large Language Models (LLMs), existing approaches still struggle with the one-to-many problem, and exhibit subpar performance in domain-specific scenarios. We assume the commonsense reasoning biases within LLMs may hinder their performance in domainspecific evaluations. To address both issues, we propose a novel framework SLIDE (Small and Large Integrated for Dialogue Evaluation), that leverages both a small, specialised model (SLM), and LLMs for the evaluation of open domain dialogues. Our approach introduces several techniques: (1) Contrastive learning to differentiate between robust and non-robust response embeddings; (2) A novel metric for semantic sensitivity that combines embedding cosine distances with similarity learned through neural networks, and (3) a strategy for incorporating the evaluation results from both the SLM and LLMs. Our empirical results demonstrate that our approach achieves state-of-the-art performance in both the classification and evaluation tasks, and additionally the SLIDE evaluator exhibits better correlation with human judgements. Our code is available at https:// github.com/hegehongcha/SLIDE-ACL2024.
