Table of Contents
Fetching ...

Detecting Response Generation Not Requiring Factual Judgment

Ryohei Kamei, Daiki Shiono, Reina Akama, Jun Suzuki

TL;DR

This study aimed to achieve both attractiveness and factuality in a dialogue response for which a task was set to predict sentences that do not require factual correctness judgment such as agreeing, or personal opinions/feelings.

Abstract

With the remarkable development of large language models (LLMs), ensuring the factuality of output has become a challenge. However, having all the contents of the response with given knowledge or facts is not necessarily a good thing in dialogues. This study aimed to achieve both attractiveness and factuality in a dialogue response for which a task was set to predict sentences that do not require factual correctness judgment such as agreeing, or personal opinions/feelings. We created a dataset, dialogue dataset annotated with fact-check-needed label (DDFC), for this task via crowdsourcing, and classification tasks were performed on several models using this dataset. The model with the highest classification accuracy could yield about 88% accurate classification results.

Detecting Response Generation Not Requiring Factual Judgment

TL;DR

This study aimed to achieve both attractiveness and factuality in a dialogue response for which a task was set to predict sentences that do not require factual correctness judgment such as agreeing, or personal opinions/feelings.

Abstract

With the remarkable development of large language models (LLMs), ensuring the factuality of output has become a challenge. However, having all the contents of the response with given knowledge or facts is not necessarily a good thing in dialogues. This study aimed to achieve both attractiveness and factuality in a dialogue response for which a task was set to predict sentences that do not require factual correctness judgment such as agreeing, or personal opinions/feelings. We created a dataset, dialogue dataset annotated with fact-check-needed label (DDFC), for this task via crowdsourcing, and classification tasks were performed on several models using this dataset. The model with the highest classification accuracy could yield about 88% accurate classification results.
Paper Structure (28 sections, 3 figures, 6 tables)

This paper contains 28 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of the study and the collected dataset, DDFC. The existing dialogue responses based on knowledge are divided into sentences. Each sentence was annotated labels according to its type and used in a classification task.
  • Figure 2: Flowchart of annotation by Amazon Mechanical Turk to construct DDFC.
  • Figure 3: Relationship between the amount of training data and accuracy. The accuracy of $\text{Llama 2}_{\text{Chat 7B}}$ significantly improves with over 800 training data, suggesting that more data will lead to even higher accuracy. Overall, $\text{DeBERTa v3}_{\text{large}}$ showed a steady increase in accuracy compared to $\text{Llama 2}_{\text{Chat 7B}}$.