An Empirical Study of Retrieval Augmented Generation with Chain-of-Thought
Yuetong Zhao, Hongyu Cao, Xianyu Zhao, Zhijian Ou
TL;DR
The paper addresses the challenge of making generative dialogue models more accurate, coherent, and capable of complex reasoning. It introduces RAFT, a method that combines Retrieval-Augmented Generation (RAG), Supervised Fine-Tuning (SFT), and Chain-of-Thought (CoT) to train small-scale models on Q, oracle documents, distractors, and CoT-style answers. Across English and Chinese tasks (HotpotQA, PubMedQA, DuReader_robust), RAFT delivers notable gains in short- and long-form QA, with CoT providing the largest benefits in noisy retrieval scenarios and more complex reasoning. The findings highlight RAFT’s potential to improve reasoning and information extraction in practical, multilingual dialogue systems without requiring extremely large models.
Abstract
Since the launch of ChatGPT at the end of 2022, generative dialogue models represented by ChatGPT have quickly become essential tools in daily life. As user expectations increase, enhancing the capability of generative dialogue models to solve complex problems has become a focal point of current research. This paper delves into the effectiveness of the RAFT (Retrieval Augmented Fine-Tuning) method in improving the performance of Generative dialogue models. RAFT combines chain-of-thought with model supervised fine-tuning (SFT) and retrieval augmented generation (RAG), which significantly enhanced the model's information extraction and logical reasoning abilities. We evaluated the RAFT method across multiple datasets and analysed its performance in various reasoning tasks, including long-form QA and short-form QA tasks, tasks in both Chinese and English, and supportive and comparison reasoning tasks. Notably, it addresses the gaps in previous research regarding long-form QA tasks and Chinese datasets. Moreover, we also evaluate the benefit of the chain-of-thought (CoT) in the RAFT method. This work offers valuable insights for studies focused on enhancing the performance of generative dialogue models.
