Hi Model, generating 'nice' instead of 'good' is not as bad as generating 'rice'! Towards Context and Semantic Infused Dialogue Generation Loss Function and Evaluation Metric
Abhisek Tiwari, Muhammed Sinan, Kaushik Roy, Amit Sheth, Sriparna Saha, Pushpak Bhattacharyya
TL;DR
The paper tackles the mismatch between traditional cross-entropy loss and word-based evaluation metrics and human judgment in dialogue generation by introducing semantics- and context-aware components. It proposes the SemTextualLogue loss, which combines $L_{CE}$, a semantic/contextual term $L_{SCL}$, and a baseline-based $L_{BSE}$, guided by hyperparameters $\lambda$ and $\sigma$, and a context- and semantic-aware evaluation metric called Dialution that blends contextual relevance $CR$ and semantic similarity $SS$ with weights $\delta_c$ and $\delta_{ss}$. The approach integrates a Contanic score to quantify context and meaning during training, and is evaluated on both task-oriented and open-domain dialogues (MultiWoz 2.2 and PersonaChat) using encoder–decoder and pre-trained models (GPT-2, LLaMA), demonstrating improved agreement with human judgments and stronger embedding-based correlations than traditional metrics. These findings suggest that incorporating semantics and context into both loss functions and evaluation metrics yields more coherent, contextually appropriate, and human-aligned dialogue systems, with implications for practical deployment and future research. Notably, the work provides a concrete mathematical framework, including $L_{final}$ and Dialution formulations, that can be extended with external knowledge and adapted to diverse dialogue settings.
Abstract
Over the past two decades, dialogue modeling has made significant strides, moving from simple rule-based responses to personalized and persuasive response generation. However, despite these advancements, the objective functions and evaluation metrics for dialogue generation have remained stagnant. These lexical-based metrics, e.g., cross-entropy and BLEU, have two key limitations: (a) word-to-word matching without semantic consideration: It assigns the same credit for failure to generate "nice" and "rice" for "good", (b) missing context attribute for evaluating the generated response: Even if a generated response is relevant to the ongoing dialogue context, it may still be penalized for not matching the gold utterance provided in the corpus. In this paper, we first investigate these limitations comprehensively and propose a new loss function called Semantic Infused Contextualized diaLogue (SemTextualLogue) loss function. We also formulate an evaluation metric called Dialuation, incorporating both context and semantic relevance. We experimented with both non-pretrained and pre-trained models on two dialogue corpora, encompassing task-oriented and open-domain scenarios. We found that the dialogue generation models trained with SemTextualLogueloss attained superior performance compared to the traditional cross-entropy loss function. The findings establish that the effective training of a dialogue generation model hinges significantly on incorporating semantics and context. This pattern is also mirrored in the introduced Dialuation metric, where the consideration of both context and semantics correlates more strongly with human evaluation compared to traditional metrics.
