Table of Contents
Fetching ...

An Empirical Study of Retrieval Augmented Generation with Chain-of-Thought

Yuetong Zhao, Hongyu Cao, Xianyu Zhao, Zhijian Ou

TL;DR

The paper addresses the challenge of making generative dialogue models more accurate, coherent, and capable of complex reasoning. It introduces RAFT, a method that combines Retrieval-Augmented Generation (RAG), Supervised Fine-Tuning (SFT), and Chain-of-Thought (CoT) to train small-scale models on Q, oracle documents, distractors, and CoT-style answers. Across English and Chinese tasks (HotpotQA, PubMedQA, DuReader_robust), RAFT delivers notable gains in short- and long-form QA, with CoT providing the largest benefits in noisy retrieval scenarios and more complex reasoning. The findings highlight RAFT’s potential to improve reasoning and information extraction in practical, multilingual dialogue systems without requiring extremely large models.

Abstract

Since the launch of ChatGPT at the end of 2022, generative dialogue models represented by ChatGPT have quickly become essential tools in daily life. As user expectations increase, enhancing the capability of generative dialogue models to solve complex problems has become a focal point of current research. This paper delves into the effectiveness of the RAFT (Retrieval Augmented Fine-Tuning) method in improving the performance of Generative dialogue models. RAFT combines chain-of-thought with model supervised fine-tuning (SFT) and retrieval augmented generation (RAG), which significantly enhanced the model's information extraction and logical reasoning abilities. We evaluated the RAFT method across multiple datasets and analysed its performance in various reasoning tasks, including long-form QA and short-form QA tasks, tasks in both Chinese and English, and supportive and comparison reasoning tasks. Notably, it addresses the gaps in previous research regarding long-form QA tasks and Chinese datasets. Moreover, we also evaluate the benefit of the chain-of-thought (CoT) in the RAFT method. This work offers valuable insights for studies focused on enhancing the performance of generative dialogue models.

An Empirical Study of Retrieval Augmented Generation with Chain-of-Thought

TL;DR

The paper addresses the challenge of making generative dialogue models more accurate, coherent, and capable of complex reasoning. It introduces RAFT, a method that combines Retrieval-Augmented Generation (RAG), Supervised Fine-Tuning (SFT), and Chain-of-Thought (CoT) to train small-scale models on Q, oracle documents, distractors, and CoT-style answers. Across English and Chinese tasks (HotpotQA, PubMedQA, DuReader_robust), RAFT delivers notable gains in short- and long-form QA, with CoT providing the largest benefits in noisy retrieval scenarios and more complex reasoning. The findings highlight RAFT’s potential to improve reasoning and information extraction in practical, multilingual dialogue systems without requiring extremely large models.

Abstract

Since the launch of ChatGPT at the end of 2022, generative dialogue models represented by ChatGPT have quickly become essential tools in daily life. As user expectations increase, enhancing the capability of generative dialogue models to solve complex problems has become a focal point of current research. This paper delves into the effectiveness of the RAFT (Retrieval Augmented Fine-Tuning) method in improving the performance of Generative dialogue models. RAFT combines chain-of-thought with model supervised fine-tuning (SFT) and retrieval augmented generation (RAG), which significantly enhanced the model's information extraction and logical reasoning abilities. We evaluated the RAFT method across multiple datasets and analysed its performance in various reasoning tasks, including long-form QA and short-form QA tasks, tasks in both Chinese and English, and supportive and comparison reasoning tasks. Notably, it addresses the gaps in previous research regarding long-form QA tasks and Chinese datasets. Moreover, we also evaluate the benefit of the chain-of-thought (CoT) in the RAFT method. This work offers valuable insights for studies focused on enhancing the performance of generative dialogue models.
Paper Structure (14 sections, 1 equation, 4 figures, 3 tables)

This paper contains 14 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of RAFT
  • Figure 2: Construction of RAFT fine-tuning dataset
  • Figure 3: Examples of Chinese chain-of-thought style response generation process via GPT-3.5 in RAFT.
  • Figure 4: Examples of English chain-of-thought style response generation process via GPT-3.5 in RAFT.