ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities

Peng Xu; Wei Ping; Xianchao Wu; Chejian Xu; Zihan Liu; Mohammad Shoeybi; Bryan Catanzaro

ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities

Peng Xu, Wei Ping, Xianchao Wu, Chejian Xu, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro

TL;DR

ChatQA 2 presents a 128K-context, open-weight Llama 3.0–based model that closes the gap with GPT-4 Turbo in long-context understanding and retrieval-augmented generation. The authors extend the base model via continual pretraining and a three-stage instruction-tuning pipeline to boost long-context reasoning and RAG performance, and demonstrate strong results on ultra-long and RAG benchmarks while releasing data and scripts publicly. They also analyze the comparative strengths of long-context versus RAG, showing that increasing the number of retrieved chunks can make RAG outperform full long-context solutions in some settings. Overall, the work provides a practical, reproducible path to high-capacity open LLMs and highlights the complementary roles of long-context processing and retrieval-based augmentation for real-world workloads.

Abstract

In this work, we introduce ChatQA 2, an Llama 3.0-based model with a 128K context window, designed to bridge the gap between open-source LLMs and leading proprietary models (e.g., GPT-4-Turbo-2024-04-09) in long context understanding and retrieval-augmented generation (RAG) capabilities. These two capabilities are complementary to each other and essential for LLMs to process large volumes of information that cannot fit into a single prompt. We present a detailed continued training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens, along with a three-stage instruction tuning process to enhance the model's instruction-following, RAG performance, and long-context understanding capabilities. Our results demonstrate that the Llama3-ChatQA-2-70B model outperforms most existing state-of-the-art models, including GPT-4-Turbo-2024-04-09, Qwen2-72B-Instruct, and Llama3.1-70B-Instruct, on ultra-long tasks beyond 100K tokens, as well as on the RAG benchmark using only a 4K context window, showing the strong long context capability across varying sequence lengths. We further provide extensive comparisons between direct long-context and RAG solutions using the same state-of-the-art long-context LLMs. Interestingly, we find that the performance of strong long-context LLMs using RAG improves when retrieving a larger number of chunks. With a large set of top-k chunks, RAG consistently outperforms direct long-context solution using the same state-of-the-art long-context models (e.g., Llama3-ChatQA-2-70B and Qwen2-72B-Instruct) on both 32K and 128K benchmarks. We open-source the model weights, training data, and the evaluation setup for the for the community: https://chatqa2-project.github.io/

ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities

TL;DR

Abstract

Paper Structure (25 sections, 4 figures, 10 tables)

This paper contains 25 sections, 4 figures, 10 tables.

Introduction
Related Work
Long Context LLM
Retrieval-augmented Generation (RAG)
Method
Extending Context Window to 128K
Instruction-Tuning with Long Context Data
Long Context Retriever meets Long Context LLM
Baselines and Evaluation Benchmarks
Long context models
Retrieval-augmented generation (RAG)
Ultra-long Context Benchmarks Beyond 100K Tokens
Long Context Benchmarks within 32K Tokens
Short Context within 4K Tokens
Results
...and 10 more sections

Figures (4)

Figure 1: Needle In A Haystack test for (a) Llama3.1-Instruct 8B, (b) Llama3.1-Instruct-70B, (c) Llama3-ChatQA-2-8B, and (d) Llama3-ChatQA-2-70B, up to 128K context window. We show the result using the same needle: "The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day."
Figure 2: Ablation of $\mathtt{Llama3\text{-}ChatQA\text{-}2\text{-}70B}$ with RAG given different top-k = {5, 10, 20, 40} retrieval, and chunk-size = {300, 600, 1200} on long context benchmarks within 32K tokens (see Section \ref{['sec:medim-long-32k']} for more details). The accuracy can be monotonically improved with more retrieved tokens (i.e., k$\times$ chunk_size) in the context window.
Figure 3: Needle in A Haystack test for Llama3.1-Instruct 8B and Llama3.1-Instruct-70B up to 128K context window. We show the Passkey retrieval results here. The needle is set as "The pass key is 385243. Remember it. 385243 is the pass key." with the question asking "What is the pass key?"
Figure 4: NIAH shows that using document breaker “<s>” is much better than <EOS> <BOS>

ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities

TL;DR

Abstract

ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities

Authors

TL;DR

Abstract

Table of Contents

Figures (4)