Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models
Shuaijie She, Shujian Huang, Xingyun Wang, Yanke Zhou, Jiajun Chen
TL;DR
This work presents the Dialogue Comprehension Benchmark (DIAC) to assess factual consistency and dialogue understanding in large language models. It introduces two tasks: DIAC-Sum for evaluating the faithfulness of dialogue summaries and DIAC-FactQA for probing derived factual QA from those summaries, using SAMSum as the evaluation corpus. The study reveals substantial deficiencies in current LLMs, with average inconsistency rates around 26.8% in summaries and 36.1% in factual QA, and highlights subject-object understanding as a key bottleneck. To address this, the authors propose a multi-task, auto-constructed data augmentation paradigm (via LoRA) that improves dialogue understanding, achieving relative error reductions of about 11% on DIAC-FactQA and 27.6% on DREAM, indicating a viable path toward more reliable dialogue comprehension in LLMs.
Abstract
LLMs (Large Language Models) usually interact with users in the form of dialogue and generate responses following their instructions, which naturally require dialogue comprehension abilities. However, dialogue comprehension is a general language ability which is hard to be evaluated directly. In this work, we propose to perform the evaluation focusing on the factual consistency issue with the help of the dialogue summarization task. Besides evaluating and analyzing the dialogue summarization performance (DIAC-Sum) of different LLMs, we also derive factual questions from the generated summaries and use them as a more flexible measurement of dialogue comprehension (DIAC-QA). Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency. Even ChatGPT, the strongest model evaluated, has such errors in 16% of its summaries. For answering the factual questions, which is more challenging, the average error rate of all evaluated LLMs is 36.1%. Both results indicate serious deficiencies. Detailed analysis shows that the understanding of subject/object of the conversation is still challenging for LLMs. Furthermore, to stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data, which achieved a relative error rate reduction of 11% on DIAC-QA.
