Table of Contents
Fetching ...

Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs

Yuzhe Gu, Wenwei Zhang, Chengqi Lyu, Dahua Lin, Kai Chen

TL;DR

Mask-DPO presents a fine-grained factuality alignment method that embeds sentence-level factuality signals into Direct Preference Optimization, enabling learning only from factually correct sentences and ignoring incorrect ones in the training signal. Empirically, it substantially improves factuality on in-domain ANAH-v2 and out-of-domain Biography data for Llama3.1-8B-Instruct, outperforming vanilla DPO, FactTune, and several open-source baselines. The authors also analyze data-scaling strategies and propose a model-specific knowledge-graph hypothesis to explain generalization to unseen topics, supported by proof-of-concept experiments. Overall, Mask-DPO demonstrates that sentence-level masking can significantly enhance factuality while offering insights into how factual alignment may reshape internal knowledge structures of LLMs, with implications for scalable factuality alignment.

Abstract

Large language models (LLMs) exhibit hallucinations (i.e., unfaithful or nonsensical information) when serving as AI assistants in various domains. Since hallucinations always come with truthful content in the LLM responses, previous factuality alignment methods that conduct response-level preference learning inevitably introduced noises during training. Therefore, this paper proposes a fine-grained factuality alignment method based on Direct Preference Optimization (DPO), called Mask-DPO. Incorporating sentence-level factuality as mask signals, Mask-DPO only learns from factually correct sentences in the preferred samples and prevents the penalty on factual contents in the not preferred samples, which resolves the ambiguity in the preference learning. Extensive experimental results demonstrate that Mask-DPO can significantly improve the factuality of LLMs responses to questions from both in-domain and out-of-domain datasets, although these questions and their corresponding topics are unseen during training. Only trained on the ANAH train set, the score of Llama3.1-8B-Instruct on the ANAH test set is improved from 49.19% to 77.53%, even surpassing the score of Llama3.1-70B-Instruct (53.44%), while its FactScore on the out-of-domain Biography dataset is also improved from 30.29% to 39.39%. We further study the generalization property of Mask-DPO using different training sample scaling strategies and find that scaling the number of topics in the dataset is more effective than the number of questions. We provide a hypothesis of what factual alignment is doing with LLMs, on the implication of this phenomenon, and conduct proof-of-concept experiments to verify it. We hope the method and the findings pave the way for future research on scaling factuality alignment.

Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs

TL;DR

Mask-DPO presents a fine-grained factuality alignment method that embeds sentence-level factuality signals into Direct Preference Optimization, enabling learning only from factually correct sentences and ignoring incorrect ones in the training signal. Empirically, it substantially improves factuality on in-domain ANAH-v2 and out-of-domain Biography data for Llama3.1-8B-Instruct, outperforming vanilla DPO, FactTune, and several open-source baselines. The authors also analyze data-scaling strategies and propose a model-specific knowledge-graph hypothesis to explain generalization to unseen topics, supported by proof-of-concept experiments. Overall, Mask-DPO demonstrates that sentence-level masking can significantly enhance factuality while offering insights into how factual alignment may reshape internal knowledge structures of LLMs, with implications for scalable factuality alignment.

Abstract

Large language models (LLMs) exhibit hallucinations (i.e., unfaithful or nonsensical information) when serving as AI assistants in various domains. Since hallucinations always come with truthful content in the LLM responses, previous factuality alignment methods that conduct response-level preference learning inevitably introduced noises during training. Therefore, this paper proposes a fine-grained factuality alignment method based on Direct Preference Optimization (DPO), called Mask-DPO. Incorporating sentence-level factuality as mask signals, Mask-DPO only learns from factually correct sentences in the preferred samples and prevents the penalty on factual contents in the not preferred samples, which resolves the ambiguity in the preference learning. Extensive experimental results demonstrate that Mask-DPO can significantly improve the factuality of LLMs responses to questions from both in-domain and out-of-domain datasets, although these questions and their corresponding topics are unseen during training. Only trained on the ANAH train set, the score of Llama3.1-8B-Instruct on the ANAH test set is improved from 49.19% to 77.53%, even surpassing the score of Llama3.1-70B-Instruct (53.44%), while its FactScore on the out-of-domain Biography dataset is also improved from 30.29% to 39.39%. We further study the generalization property of Mask-DPO using different training sample scaling strategies and find that scaling the number of topics in the dataset is more effective than the number of questions. We provide a hypothesis of what factual alignment is doing with LLMs, on the implication of this phenomenon, and conduct proof-of-concept experiments to verify it. We hope the method and the findings pave the way for future research on scaling factuality alignment.

Paper Structure

This paper contains 24 sections, 8 equations, 3 figures, 12 tables.

Figures (3)

  • Figure 1: Comparison between DPO and Mask-DPO. Vanilla DPO (a) inadvertently encourages and penalizes all the content in the preferred and non-preferred samples, respectively, regardless of their correctness. Instead, Mask-DPO (b) incorporates sentence-level facticity into the mask signal, preventing incorrect reward signal, which resolves ambiguity in preference learning.
  • Figure 2: The overview of Mask-DPO. First, we sample K candidate responses for each question from the policy model. Then, we use a fine-grained hallucination annotator to perform a sentence-level factuality annotation on each response. We use the proportion of correct sentences out of the total number of sentences as the factuality score. We select the responses with the highest and lowest scores as preferred and non-preferred samples, respectively. Finally, we perform fine-grained factuality alignment on the policy model using such fine-grained preference data, where the reward signals to the sentences, i.e., incorrect sentences in the preferred samples and correct sentences in the non-preferred samples, would be ignored.
  • Figure 3: The case study about the generated responses before and after Mask-DPO. Here, "Baseline" denotes the response from the model before Mask-DPO, i.e. Llama3.1-8B-Instruct. "Mask-DPO" denotes the response from the model after Mask-DPO. In generations, we use blue color to present the hallucination content.