FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering

Tianchi Cai; Zhiwen Tan; Xierui Song; Tao Sun; Jiyan Jiang; Yunqi Xu; Yinger Zhang; Jinjie Gu

FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering

Tianchi Cai, Zhiwen Tan, Xierui Song, Tao Sun, Jiyan Jiang, Yunqi Xu, Yinger Zhang, Jinjie Gu

TL;DR

FoRAG tackles factuality and logical structure in web-enhanced LFQA by introducing an outline-enhanced generator to enforce clear organization and a doubly fine-grained RLHF framework to optimize factuality at multiple evaluation and reward granularities. The approach yields state-of-the-art results on English and Chinese benchmarks, with FoRAG-L-7B surpassing WebGPT-175B while using far fewer parameters. The work also delivers two large outline-enhanced bilingual datasets and analyzes training efficiency, showing that the improvements come with reasonable computational overhead. Overall, FoRAG significantly improves coherence, helpfulness, and factuality in long-form web-guided QA and provides publicly available resources to facilitate reproducibility and further research.

Abstract

Retrieval Augmented Generation (RAG) has become prevalent in question-answering (QA) tasks due to its ability of utilizing search engine to enhance the quality of long-form question-answering (LFQA). Despite the emergence of various open source methods and web-enhanced commercial systems such as Bing Chat, two critical problems remain unsolved, i.e., the lack of factuality and clear logic in the generated long-form answers. In this paper, we remedy these issues via a systematic study on answer generation in web-enhanced LFQA. Specifically, we first propose a novel outline-enhanced generator to achieve clear logic in the generation of multifaceted answers and construct two datasets accordingly. Then we propose a factuality optimization method based on a carefully designed doubly fine-grained RLHF framework, which contains automatic evaluation and reward modeling in different levels of granularity. Our generic framework comprises conventional fine-grained RLHF methods as special cases. Extensive experiments verify the superiority of our proposed \textit{Factuality-optimized RAG (FoRAG)} method on both English and Chinese benchmarks. In particular, when applying our method to Llama2-7B-chat, the derived model FoRAG-L-7B outperforms WebGPT-175B in terms of three commonly used metrics (i.e., coherence, helpfulness, and factuality), while the number of parameters is much smaller (only 1/24 of that of WebGPT-175B). Our datasets and models are made publicly available for better reproducibility: https://huggingface.co/forag.

FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering

TL;DR

Abstract

Paper Structure (21 sections, 4 equations, 2 figures, 8 tables)

This paper contains 21 sections, 4 equations, 2 figures, 8 tables.

Introduction
Related Work
Preliminary
Outline-Enhanced RAG
Outline-Enhanced Generator
Outline-Enhanced Long-Form QA Dataset
Factuality-Optimized RAG
Difficulties of Directly Applying RLHF
Doubly Fine-grained RLHF
Experiment
Experimental setup
Main results
Comparison of Various Factuality Optimization Granularities
Ablation study
Evaluation of Training Efficiency
...and 6 more sections

Figures (2)

Figure 1: Illustrations of the input for LLM in web-enhanced LFQA task (upper left), the existing generator (lower left), our outline-enhanced generator (middle) and our doubly fine-grained factuality optimization method (right). Before generating a long answer, the outline-enhanced generator first drafts an organization pattern and an outline to promote a clear logic for generation. The doubly fined-grained RLHF optimizes factuality by incorporating fine-grained designs on two core steps, i.e. factuality evaluation and reward modeling, with methods on multiple levels of granularities proposed on each step.
Figure 2: Evaluation results in terms of various metrics of different models fine-tuned from Llama2-7B. We vary the ratio of the Chinese samples to the English samples in the training dataset.

FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering

TL;DR

Abstract

FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (2)