GRATH: Gradual Self-Truthifying for Large Language Models

Weixin Chen; Dawn Song; Bo Li

GRATH: Gradual Self-Truthifying for Large Language Models

Weixin Chen, Dawn Song, Bo Li

TL;DR

GRATH addresses the persistent truthfulness problem in LLMs by using out-of-domain prompts to generate pairwise truthfulness data and optimizing with Direct Preference Optimization in a self-supervised, gradual framework. The method alternates data refinement and model updates, achieving state-of-the-art truthfulness on TruthfulQA MC1/MC2 with 7B models while preserving performance on established benchmarks like ARC, HellaSwag, and MMLU. Key insights reveal the impact of domain gap and distributional distance on learning truthfulness, and the approach demonstrates strong robustness to domain shifts compared to traditional alignment methods. Overall, GRATH offers an efficient, post-processing pathway to substantially boost truthfulness across diverse LLMs without requiring human-annotated answers for OOD prompts.

Abstract

Truthfulness is paramount for large language models (LLMs) as they are increasingly deployed in real-world applications. However, existing LLMs still struggle with generating truthful content, as evidenced by their modest performance on benchmarks like TruthfulQA. To address this issue, we propose GRAdual self-truTHifying (GRATH), a novel post-processing method to enhance truthfulness of LLMs. GRATH utilizes out-of-domain question prompts to generate pairwise truthfulness training data with each pair containing a question and its correct and incorrect answers, and then optimizes the model via direct preference optimization (DPO) to learn from the truthfulness difference between answer pairs. GRATH iteratively refines truthfulness data and updates the model, leading to a gradual improvement in model truthfulness in a self-supervised manner. Empirically, we evaluate GRATH using different 7B-LLMs and compare with LLMs with similar or even larger sizes on benchmark datasets. Our results show that GRATH effectively improves LLMs' truthfulness without compromising other core capabilities. Notably, GRATH achieves state-of-the-art performance on TruthfulQA, with MC1 accuracy of 54.71% and MC2 accuracy of 69.10%, which even surpass those on 70B-LLMs.

GRATH: Gradual Self-Truthifying for Large Language Models

TL;DR

Abstract

Paper Structure (36 sections, 1 equation, 13 figures, 3 tables, 1 algorithm)

This paper contains 36 sections, 1 equation, 13 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Method
Creating Pairwise Truthfulness Data
Questions.
Prompt.
Few-shot Demonstrations.
Self-Truthifying
Gradual Self-Truthifying
Step 1: Refining Data.
Step 2: Updating Model.
Experiments
Experimental Setup
Models.
Baseline Methods.
...and 21 more sections

Figures (13)

Figure 1: Accuracy of pretrained models, DPO, and GRATH on TruthfulQA's MC1 and MC2 tasks. We evaluate DPO and GRATH on two pretrained models, Llama2-Chat-7B and Zephyr, based on an OOD training datasetâ€”ARC-Challenge. GRATH effectively improves MC1 and MC2 accuracy of Llama2-Chat-7B(Zephyr) by 24.5%(11.6%) and 23.8%(8.9%). DPO enhances MC1 and MC2 accuracy of Llama2-Chat-7B by 6.5% and 6.8% while decreasing those on Zephyr by 4.2% and 2.6%. The performance of DPO compared to GRATH indicates its vulnerability to OOD data.
Figure 2: Framework of GRATH, which consists of three components. Given a pretrained base model, GRATH (a) creates pairwise truthfulness training data via few-shot prompting. An illustrative example is on the left. A pair of truthfulness data includes a question, a correct answer and an incorrect answer. (b) Fine-tune the pretrained model via DPO based on the pairwise truthfulness training data. The model will learn from the truthfulness difference in the self-generated answer pairs and enhance its truthfulness (i.e., self-truthify itself). (c) Iteratively generate data and optimize model for $T$ iterations, thus gradually boosting model truthfulness in a self-supervised manner.
Figure 3: MC1 and MC2 accuracy of DPO (left) and SFT (right) with varying degrees of transformations applied on the answers in pairwise truthfulness training data. A larger top-$p$ indicates a larger domain gap between truthfulness training data and testing data. Downward trends here in each figure indicate that the model learned by either DPO or SFT will be less truthful if the domain gap in answers increases. DPO performs better than SFT overall.
Figure 4: MC1 and MC2 accuracy of DPO (left) and SFT (right). DPO and SFT are applied with truthfulness training data created using a variety of strategies. $Q_{OOD}$ ($Q_{IND}$) represents using OOD (in-domain) questions from ARC-C (TruthfulQA); $A_{OOD}$ indicates using annotated answers from ARC-C; $G(\cdot)$ indicates using answers generated by LLMs. Specifically, $FS_{OOD}$ ($FS_{IND}$) corresponds to OOD (in-domain) few-shot demonstrations, and $GT$ implies merging ground-truth answers into the prompts. We find i) $DPO^{Q_{OOD}}_{G(FS_{IND})}$ performs the best among all $DPO^{Q_{OOD}}$, indicating that the usage of in-domain demonstrations yields answers that are closer to testing domain, leading to a more truthful model. ii) Arrows symbolize performance shifts as questions are transitioned from OOD to in-domain. The trends towards upper right indicate that the model will be more truthful if the domain gap in questions decreases. The same findings hold for SFT. iii) DPO outperforms SFT since it improves the pretrained model's truthfulness in general.
Figure 5: Left: Distributions of pairwise distance in truthfulness data used in $DPO^{1,2}$. Right: Performance of the pretrained model fine-tuned with truthfulness data used in $DPO^{1,2}$ on TruthfulQA. The improved performance indicates that a larger distributional distance within truthfulness data leads to a more truthful model.
...and 8 more figures

GRATH: Gradual Self-Truthifying for Large Language Models

TL;DR

Abstract

GRATH: Gradual Self-Truthifying for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (13)