$\textit{Dial BeInfo for Faithfulness}$: Improving Factuality of Information-Seeking Dialogue via Behavioural Fine-Tuning
Evgeniia Razumovskaia, Ivan Vulić, Pavle Marković, Tomasz Cichy, Qian Zheng, Tsung-Hsien Wen, Paweł Budzianowski
TL;DR
BeInfo tackles factual hallucinations in information-seeking dialogue by applying behavioural fine-tuning to instruction-tuned LLMs, guiding them to ground responses in the provided knowledge $\,mathcal{K}\,$. It augments training with knowledge distractors $\,mathcal{K}'\,$ and unanswerable dialogs to improve selectivity and response adequacy, achieving improved faithfulness across FaithDial, TopiOCQA, and DoQA, including strong zero-shot transfer. The method yields further gains when combined with task-specific fine-tuning and demonstrates competitive performance in real conversations, even approaching or surpassing GPT-4 in some settings for smaller models. These results highlight BeInfo as a practical, domain-robust strategy to enhance factuality in production-ready information-seeking dialogue systems, with potential for integration with other faithfulness techniques and broader applicability beyond QA.
Abstract
Factuality is a crucial requirement in information seeking dialogue: the system should respond to the user's queries so that the responses are meaningful and aligned with the knowledge provided to the system. However, most modern large language models suffer from hallucinations, that is, they generate responses not supported by or contradicting the knowledge source. To mitigate the issue and increase faithfulness of information-seeking dialogue systems, we introduce BeInfo, a simple yet effective method that applies behavioural tuning to aid information-seeking dialogue. Relying on three standard datasets, we show that models tuned with BeInfo} become considerably more faithful to the knowledge source both for datasets and domains seen during BeInfo-tuning, as well as on unseen domains, when applied in a zero-shot manner. In addition, we show that the models with 3B parameters (e.g., Flan-T5) tuned with BeInfo demonstrate strong performance on data from real `production' conversations and outperform GPT4 when tuned on a limited amount of such realistic in-domain dialogues.
