Table of Contents
Fetching ...

CoTEVer: Chain of Thought Prompting Annotation Toolkit for Explanation Verification

Seungone Kim, Se June Joo, Yul Jang, Hyungjoo Chae, Jinyoung Yeo

TL;DR

CoTEVer addresses the problem of unfaithful chain-of-thought explanations by introducing an annotation toolkit that verifies explanations against retrieved evidence and collects revision data. It combines prompting, evidence retrieval, and annotator verification to produce high-quality, grounded CoT data, enabling downstream CoT fine-tuning and knowledge-intensive task development. The paper also analyzes common explanation errors and outlines practical use cases, including unlikelihood training and fact verification. Public availability of the toolkit suggests potential for broad adoption in improving faithful AI reasoning.

Abstract

Chain-of-thought (CoT) prompting enables large language models (LLMs) to solve complex reasoning tasks by generating an explanation before the final prediction. Despite it's promising ability, a critical downside of CoT prompting is that the performance is greatly affected by the factuality of the generated explanation. To improve the correctness of the explanations, fine-tuning language models with explanation data is needed. However, there exists only a few datasets that can be used for such approaches, and no data collection tool for building them. Thus, we introduce CoTEVer, a tool-kit for annotating the factual correctness of generated explanations and collecting revision data of wrong explanations. Furthermore, we suggest several use cases where the data collected with CoTEVer can be utilized for enhancing the faithfulness of explanations. Our toolkit is publicly available at https://github.com/SeungoneKim/CoTEVer.

CoTEVer: Chain of Thought Prompting Annotation Toolkit for Explanation Verification

TL;DR

CoTEVer addresses the problem of unfaithful chain-of-thought explanations by introducing an annotation toolkit that verifies explanations against retrieved evidence and collects revision data. It combines prompting, evidence retrieval, and annotator verification to produce high-quality, grounded CoT data, enabling downstream CoT fine-tuning and knowledge-intensive task development. The paper also analyzes common explanation errors and outlines practical use cases, including unlikelihood training and fact verification. Public availability of the toolkit suggests potential for broad adoption in improving faithful AI reasoning.

Abstract

Chain-of-thought (CoT) prompting enables large language models (LLMs) to solve complex reasoning tasks by generating an explanation before the final prediction. Despite it's promising ability, a critical downside of CoT prompting is that the performance is greatly affected by the factuality of the generated explanation. To improve the correctness of the explanations, fine-tuning language models with explanation data is needed. However, there exists only a few datasets that can be used for such approaches, and no data collection tool for building them. Thus, we introduce CoTEVer, a tool-kit for annotating the factual correctness of generated explanations and collecting revision data of wrong explanations. Furthermore, we suggest several use cases where the data collected with CoTEVer can be utilized for enhancing the faithfulness of explanations. Our toolkit is publicly available at https://github.com/SeungoneKim/CoTEVer.
Paper Structure (17 sections, 3 equations, 4 figures, 4 tables)

This paper contains 17 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Example of Explanation Verification and Answer Verification of GPT-3's output. Explanation Verification requires additional knowledge which makes it hard for annotators to intuitively write a revised explanation and answer.
  • Figure 2: The overall illustration of CoTEver. An annotator asks a question to CoTEver and receives an explanation, supporting evidence documents, and a prediction. Then, the annotator's rating of the explanation (5 for most relevant), suggestions for a better explanation is stored in the Database which can be used for research purposes.
  • Figure 3: Snapshot of CoTEVer. Annotator gets to type in a question, and receive the output of a large language model(e.g., GPT-3).
  • Figure 4: Snapshot of CoTEVer. Annotator could check the retrieved evidence documents in order to verify each step within the explanation.