Hi Model, generating 'nice' instead of 'good' is not as bad as generating 'rice'! Towards Context and Semantic Infused Dialogue Generation Loss Function and Evaluation Metric

Abhisek Tiwari; Muhammed Sinan; Kaushik Roy; Amit Sheth; Sriparna Saha; Pushpak Bhattacharyya

Hi Model, generating 'nice' instead of 'good' is not as bad as generating 'rice'! Towards Context and Semantic Infused Dialogue Generation Loss Function and Evaluation Metric

Abhisek Tiwari, Muhammed Sinan, Kaushik Roy, Amit Sheth, Sriparna Saha, Pushpak Bhattacharyya

TL;DR

The paper tackles the mismatch between traditional cross-entropy loss and word-based evaluation metrics and human judgment in dialogue generation by introducing semantics- and context-aware components. It proposes the SemTextualLogue loss, which combines $L_{CE}$, a semantic/contextual term $L_{SCL}$, and a baseline-based $L_{BSE}$, guided by hyperparameters $\lambda$ and $\sigma$, and a context- and semantic-aware evaluation metric called Dialution that blends contextual relevance $CR$ and semantic similarity $SS$ with weights $\delta_c$ and $\delta_{ss}$. The approach integrates a Contanic score to quantify context and meaning during training, and is evaluated on both task-oriented and open-domain dialogues (MultiWoz 2.2 and PersonaChat) using encoder–decoder and pre-trained models (GPT-2, LLaMA), demonstrating improved agreement with human judgments and stronger embedding-based correlations than traditional metrics. These findings suggest that incorporating semantics and context into both loss functions and evaluation metrics yields more coherent, contextually appropriate, and human-aligned dialogue systems, with implications for practical deployment and future research. Notably, the work provides a concrete mathematical framework, including $L_{final}$ and Dialution formulations, that can be extended with external knowledge and adapted to diverse dialogue settings.

Abstract

Over the past two decades, dialogue modeling has made significant strides, moving from simple rule-based responses to personalized and persuasive response generation. However, despite these advancements, the objective functions and evaluation metrics for dialogue generation have remained stagnant. These lexical-based metrics, e.g., cross-entropy and BLEU, have two key limitations: (a) word-to-word matching without semantic consideration: It assigns the same credit for failure to generate "nice" and "rice" for "good", (b) missing context attribute for evaluating the generated response: Even if a generated response is relevant to the ongoing dialogue context, it may still be penalized for not matching the gold utterance provided in the corpus. In this paper, we first investigate these limitations comprehensively and propose a new loss function called Semantic Infused Contextualized diaLogue (SemTextualLogue) loss function. We also formulate an evaluation metric called Dialuation, incorporating both context and semantic relevance. We experimented with both non-pretrained and pre-trained models on two dialogue corpora, encompassing task-oriented and open-domain scenarios. We found that the dialogue generation models trained with SemTextualLogueloss attained superior performance compared to the traditional cross-entropy loss function. The findings establish that the effective training of a dialogue generation model hinges significantly on incorporating semantics and context. This pattern is also mirrored in the introduced Dialuation metric, where the consideration of both context and semantics correlates more strongly with human evaluation compared to traditional metrics.

Hi Model, generating 'nice' instead of 'good' is not as bad as generating 'rice'! Towards Context and Semantic Infused Dialogue Generation Loss Function and Evaluation Metric

TL;DR

, a semantic/contextual term

, and a baseline-based

, guided by hyperparameters

and

, and a context- and semantic-aware evaluation metric called Dialution that blends contextual relevance

and semantic similarity

with weights

and

. The approach integrates a Contanic score to quantify context and meaning during training, and is evaluated on both task-oriented and open-domain dialogues (MultiWoz 2.2 and PersonaChat) using encoder–decoder and pre-trained models (GPT-2, LLaMA), demonstrating improved agreement with human judgments and stronger embedding-based correlations than traditional metrics. These findings suggest that incorporating semantics and context into both loss functions and evaluation metrics yields more coherent, contextually appropriate, and human-aligned dialogue systems, with implications for practical deployment and future research. Notably, the work provides a concrete mathematical framework, including

and Dialution formulations, that can be extended with external knowledge and adapted to diverse dialogue settings.

Abstract

Paper Structure (15 sections, 12 equations, 3 figures, 10 tables)

This paper contains 15 sections, 12 equations, 3 figures, 10 tables.

Introduction
Related Works
Proposed Methodology
Response Generation Model
Contanic Score
Weighted Cross Entropy
SemTextualLogue Loss
Dialution
Experimental Details
Result and Discussion
Experimental Results
Findings and Observations
Case Study and Analysis
Conclusion
Ethical Consideration

Figures (3)

Figure 1: Illustration of the key limitation of cross entropy for dialogue generation. Some adequate responses ($y_1$ and $y_2$) are equally or more penalized as useless response ($y_2$)
Figure 2: Proposed architecture of semantic and context-reinforced dialogue generation. The dialogue generation model first generates a response based on dialogue context and current utterance. Subsequently, it calculates context and semantic relevance score ( Contanic) and reinforces the feedback with the traditional cross-entropy loss
Figure 3: One example demonstrating the significance of context and sentence semantics for evaluating dialogue responses

Hi Model, generating 'nice' instead of 'good' is not as bad as generating 'rice'! Towards Context and Semantic Infused Dialogue Generation Loss Function and Evaluation Metric

TL;DR

Abstract

Hi Model, generating 'nice' instead of 'good' is not as bad as generating 'rice'! Towards Context and Semantic Infused Dialogue Generation Loss Function and Evaluation Metric

Authors

TL;DR

Abstract

Table of Contents

Figures (3)