Exploring the Influence of Relevant Knowledge for Natural Language Generation Interpretability
Iván Martínez-Murillo, Paloma Moreda, Elena Lloret
TL;DR
The paper tackles how external knowledge influences interpretability in natural language generation, focusing on commonsense tasks. It introduces KITGI, a benchmark that pairs concept sets with retrieved ConceptNet relations and includes manually annotated outputs to study reasoning in generation using the T5-Large model. A three-stage interpretability framework analyzes the impact of removing key knowledge, regenerating outputs, and manually evaluating commonsense plausibility and concept coverage. Empirical results show a dramatic drop from $91\%$ to $6\%$ in performance when relevant external knowledge is filtered, underscoring the critical role of knowledge for coherent, comprehensive NLG and motivating interpretable evaluation frameworks beyond surface metrics.
Abstract
This paper explores the influence of external knowledge integration in Natural Language Generation (NLG), focusing on a commonsense generation task. We extend the CommonGen dataset by creating KITGI, a benchmark that pairs input concept sets with retrieved semantic relations from ConceptNet and includes manually annotated outputs. Using the T5-Large model, we compare sentence generation under two conditions: with full external knowledge and with filtered knowledge where highly relevant relations were deliberately removed. Our interpretability benchmark follows a three-stage method: (1) identifying and removing key knowledge, (2) regenerating sentences, and (3) manually assessing outputs for commonsense plausibility and concept coverage. Results show that sentences generated with full knowledge achieved 91\% correctness across both criteria, while filtering reduced performance drastically to 6\%. These findings demonstrate that relevant external knowledge is critical for maintaining both coherence and concept coverage in NLG. This work highlights the importance of designing interpretable, knowledge-enhanced NLG systems and calls for evaluation frameworks that capture the underlying reasoning beyond surface-level metrics.
