Multi-turn Response Selection with Commonsense-enhanced Language Models

Yuandong Wang; Xuhui Ren; Tong Chen; Yuxiao Dong; Nguyen Quoc Viet Hung; Jie Tang

Multi-turn Response Selection with Commonsense-enhanced Language Models

Yuandong Wang, Xuhui Ren, Tong Chen, Yuxiao Dong, Nguyen Quoc Viet Hung, Jie Tang

TL;DR

This work tackles multi-turn response selection by injecting external commonsense through a Siamese KL framework (SinLG) that couples a pre-trained language model with a graph neural network. A knowledge-extraction pipeline builds subgraphs from ConceptNet per dialogue sample, and a KG-guided training objective transfers commonsense from the GNN to the PLM via a cosine similarity loss, while a BCE loss optimizes final predictions. Experiments on two variants of PERSONA-CHAT show SinLG achieving state-of-the-art results and particularly strong gains under harder comprehension (revised personas) and low-resource settings, with efficient inference achieved by avoiding online KG processing. The method demonstrates the value of structured external knowledge for improving dialogue understanding and response selection, offering practical improvements for real-time conversational systems.

Abstract

As a branch of advanced artificial intelligence, dialogue systems are prospering. Multi-turn response selection is a general research problem in dialogue systems. With the assistance of background information and pre-trained language models, the performance of state-of-the-art methods on this problem gains impressive improvement. However, existing studies neglect the importance of external commonsense knowledge. Hence, we design a Siamese network where a pre-trained Language model merges with a Graph neural network (SinLG). SinLG takes advantage of Pre-trained Language Models (PLMs) to catch the word correlations in the context and response candidates and utilizes a Graph Neural Network (GNN) to reason helpful common sense from an external knowledge graph. The GNN aims to assist the PLM in fine-tuning, and arousing its related memories to attain better performance. Specifically, we first extract related concepts as nodes from an external knowledge graph to construct a subgraph with the context response pair as a super node for each sample. Next, we learn two representations for the context response pair via both the PLM and GNN. A similarity loss between the two representations is utilized to transfer the commonsense knowledge from the GNN to the PLM. Then only the PLM is used to infer online so that efficiency can be guaranteed. Finally, we conduct extensive experiments on two variants of the PERSONA-CHAT dataset, which proves that our solution can not only improve the performance of the PLM but also achieve an efficient inference.

Multi-turn Response Selection with Commonsense-enhanced Language Models

TL;DR

Abstract

Paper Structure (32 sections, 12 equations, 7 figures, 9 tables)

This paper contains 32 sections, 12 equations, 7 figures, 9 tables.

Introduction
Problem Statement
Definitions
Multi-turn Response Selection (MRS)
SOLUTION
Knowledge Extraction
SinLG
Two Transformations
Natural Language Understanding via PLM
Structured Knowledge Learning via GNN
KG-Guided Training
Similarity Loss.
Inference Loss.
Optimization Strategy.
Efficient Inference
...and 17 more sections

Figures (7)

Figure 1: An example for multi-turn conversations with persona information. For convenience, we provide each utterance in the dialogue with an identifier, such as P1-1 for the first item of B's original persona, A1 for the first utterance of A. The revised persona is rephrased from the original one in an implicit way, which is more challenging for the dialogue agent to comprehend and respond.
Figure 2: An overview of our solution. Note this figure is consistent with the content from Section 3.1 to Section 3.4, i.e., Eq.(\ref{['eq:score']}) to Eq.(\ref{['eq:final-loss']}). A dialogue text includes $P$, $C$, and $R$ which denote persona, context, and response candidate set, respectively, as defined in Section \ref{['sec:definitions']}. $\mathcal{G}^s$ is the subgraph extracted from the knowledge graph $\mathcal{G}$. Trans a and b represent two different transformations of the input data and are given details in Section \ref{['sec:2trans']}. Prediction $f_d$ is a neural network layer to calculate the final results from the embedding vectors, e.g., a feed-forward layer.
Figure 3: SinLG performance under the low-resource scenario on PERSONA-CHAT original.
Figure 4: SinLG performance under the low-resource scenario on PERSONA-CHAT revised.
Figure 5: Efficiency analysis of models. Note that the inference time corresponds to $1\sim7 \times 10^3$ instances. For each instance with $20$ response candidates, the average, worst, and best time costs of SinLG-S3 are $1.8315s$, $2.534s$, and $1.5523s$ while those of SinLG are $0.2171s$, $0.221s$ and $0.216s$, respectively.
...and 2 more figures

Theorems & Definitions (6)

definition 1
definition 2
definition 3
definition 4
definition 5
definition 6

Multi-turn Response Selection with Commonsense-enhanced Language Models

TL;DR

Abstract

Multi-turn Response Selection with Commonsense-enhanced Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (6)