Detecting Response Generation Not Requiring Factual Judgment

Ryohei Kamei; Daiki Shiono; Reina Akama; Jun Suzuki

Detecting Response Generation Not Requiring Factual Judgment

Ryohei Kamei, Daiki Shiono, Reina Akama, Jun Suzuki

TL;DR

This study aimed to achieve both attractiveness and factuality in a dialogue response for which a task was set to predict sentences that do not require factual correctness judgment such as agreeing, or personal opinions/feelings.

Abstract

With the remarkable development of large language models (LLMs), ensuring the factuality of output has become a challenge. However, having all the contents of the response with given knowledge or facts is not necessarily a good thing in dialogues. This study aimed to achieve both attractiveness and factuality in a dialogue response for which a task was set to predict sentences that do not require factual correctness judgment such as agreeing, or personal opinions/feelings. We created a dataset, dialogue dataset annotated with fact-check-needed label (DDFC), for this task via crowdsourcing, and classification tasks were performed on several models using this dataset. The model with the highest classification accuracy could yield about 88% accurate classification results.

Detecting Response Generation Not Requiring Factual Judgment

TL;DR

Abstract

Paper Structure (28 sections, 3 figures, 6 tables)

This paper contains 28 sections, 3 figures, 6 tables.

Introduction
Related Work
Hallucination Detection
Hallucination in Dialogue System
Knowledge-Grounded Dialogue Dataset
DDFC dataset
Idea
Construction of the dataset
Base dataset of DDFC.
Sentence split for label annotation.
Label types.
Sentence label annotation by AMT.
Analysis of the dataset
Validity of dataset annotation.
Number of each labels.
...and 13 more sections

Figures (3)

Figure 1: Overview of the study and the collected dataset, DDFC. The existing dialogue responses based on knowledge are divided into sentences. Each sentence was annotated labels according to its type and used in a classification task.
Figure 2: Flowchart of annotation by Amazon Mechanical Turk to construct DDFC.
Figure 3: Relationship between the amount of training data and accuracy. The accuracy of $\text{Llama 2}_{\text{Chat 7B}}$ significantly improves with over 800 training data, suggesting that more data will lead to even higher accuracy. Overall, $\text{DeBERTa v3}_{\text{large}}$ showed a steady increase in accuracy compared to $\text{Llama 2}_{\text{Chat 7B}}$.

Detecting Response Generation Not Requiring Factual Judgment

TL;DR

Abstract

Detecting Response Generation Not Requiring Factual Judgment

Authors

TL;DR

Abstract

Table of Contents

Figures (3)