$\textit{Dial BeInfo for Faithfulness}$: Improving Factuality of Information-Seeking Dialogue via Behavioural Fine-Tuning

Evgeniia Razumovskaia; Ivan Vulić; Pavle Marković; Tomasz Cichy; Qian Zheng; Tsung-Hsien Wen; Paweł Budzianowski

$\textit{Dial BeInfo for Faithfulness}$: Improving Factuality of Information-Seeking Dialogue via Behavioural Fine-Tuning

Evgeniia Razumovskaia, Ivan Vulić, Pavle Marković, Tomasz Cichy, Qian Zheng, Tsung-Hsien Wen, Paweł Budzianowski

TL;DR

BeInfo tackles factual hallucinations in information-seeking dialogue by applying behavioural fine-tuning to instruction-tuned LLMs, guiding them to ground responses in the provided knowledge $\,mathcal{K}\,$. It augments training with knowledge distractors $\,mathcal{K}'\,$ and unanswerable dialogs to improve selectivity and response adequacy, achieving improved faithfulness across FaithDial, TopiOCQA, and DoQA, including strong zero-shot transfer. The method yields further gains when combined with task-specific fine-tuning and demonstrates competitive performance in real conversations, even approaching or surpassing GPT-4 in some settings for smaller models. These results highlight BeInfo as a practical, domain-robust strategy to enhance factuality in production-ready information-seeking dialogue systems, with potential for integration with other faithfulness techniques and broader applicability beyond QA.

Abstract

Factuality is a crucial requirement in information seeking dialogue: the system should respond to the user's queries so that the responses are meaningful and aligned with the knowledge provided to the system. However, most modern large language models suffer from hallucinations, that is, they generate responses not supported by or contradicting the knowledge source. To mitigate the issue and increase faithfulness of information-seeking dialogue systems, we introduce BeInfo, a simple yet effective method that applies behavioural tuning to aid information-seeking dialogue. Relying on three standard datasets, we show that models tuned with BeInfo} become considerably more faithful to the knowledge source both for datasets and domains seen during BeInfo-tuning, as well as on unseen domains, when applied in a zero-shot manner. In addition, we show that the models with 3B parameters (e.g., Flan-T5) tuned with BeInfo demonstrate strong performance on data from real `production' conversations and outperform GPT4 when tuned on a limited amount of such realistic in-domain dialogues.

$\textit{Dial BeInfo for Faithfulness}$: Improving Factuality of Information-Seeking Dialogue via Behavioural Fine-Tuning

TL;DR

BeInfo tackles factual hallucinations in information-seeking dialogue by applying behavioural fine-tuning to instruction-tuned LLMs, guiding them to ground responses in the provided knowledge

. It augments training with knowledge distractors

and unanswerable dialogs to improve selectivity and response adequacy, achieving improved faithfulness across FaithDial, TopiOCQA, and DoQA, including strong zero-shot transfer. The method yields further gains when combined with task-specific fine-tuning and demonstrates competitive performance in real conversations, even approaching or surpassing GPT-4 in some settings for smaller models. These results highlight BeInfo as a practical, domain-robust strategy to enhance factuality in production-ready information-seeking dialogue systems, with potential for integration with other faithfulness techniques and broader applicability beyond QA.

Abstract

Paper Structure (14 sections, 7 figures, 13 tables)

This paper contains 14 sections, 7 figures, 13 tables.

Introduction
Methodology
Experimental Setup
Results and Discussion
Faithfulness on Unseen Data.
Evaluating BeInfo on Real Conversations
Related Work
Conclusion and Future Work
Example of Input
Additional Dataset Statistics and Characteristics
Per-Domain Performance on DoQA
Zero-Shot Results on TopiOCQA
Evaluating Faithfulness with GPT4-Eval
Results for Faithfulness vs. Abstractiveness

Figures (7)

Figure 1: An example of an information-seeking dialogue based on the DoQA dataset campos-etal-2020-doqa. Potential responses R1, R2, R3 at the bottom illustrate different issues with two crucial aspects of factual faithfulness: selectivity and response adequacy.
Figure 2: An overview of different fine-tuning and inference setups for LLMs with and without BeInfo (§\ref{['sec:exp']}).
Figure 3: Results of task-specific tuning on FaithDial (left) and TopiOCQA (right). 'Task-only' denotes Flan-T5 tuned directly on FaithDial or TopiOCQA, again with knowledge distractors. 'Full' denotes the model first tuned with BeInfo on both datasets and then further tuned on each of the datasets; see Figure \ref{['fig:setups']}.
Figure 4: Distribution of GPT4-Eval scores of 4 variants based on Flan-T5$_{\textsc{XL}}$ and GPT-4. See Table \ref{['tab:likert']} in Appendix \ref{['app:likert']} for the interpretation of the individual scores.
Figure 5: Density and $\mathcal{K}$-BERTScore on hotel-200 illustrating the trade-off between faithfulness (y-axis) and abstractiveness (x-axis) for Flan-T5$_{\textsc{XL}}$ for different setups: (i) XL-original: 'off-the-shelf' Flan-T5$_{\textsc{XL}}$; (ii) BeInfo general-only: Flan-T5$_{\textsc{XL}}$ tuned with BeInfo on FaithDial and TopiOCQA without any in-task data; iii) BeInfo task-only: Flan-T5$_{\textsc{XL}}$ finetuned only on task-specific data; iv) BeInfo full. Numeric results are provided in Appendix \ref{['sec:results_faithful_abstractiveness']}.
...and 2 more figures

$\textit{Dial BeInfo for Faithfulness}$: Improving Factuality of Information-Seeking Dialogue via Behavioural Fine-Tuning

TL;DR

Abstract

$\textit{Dial BeInfo for Faithfulness}$: Improving Factuality of Information-Seeking Dialogue via Behavioural Fine-Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)