Training With "Paraphrasing the Original Text" Teaches LLM to Better Retrieve in Long-context Tasks

Yijiong Yu; Yongfeng Huang; Zhixiao Qi; Zhe Zhou

Training With "Paraphrasing the Original Text" Teaches LLM to Better Retrieve in Long-context Tasks

Yijiong Yu, Yongfeng Huang, Zhixiao Qi, Zhe Zhou

TL;DR

This work tackles the challenge of long-context understanding in LLMs, specifically the 'lost in the middle' retrieval problem where key information in the middle of long inputs is not effectively used. It introduces a retrieval-focused fine-tuning approach by embedding an 'original text paraphrasing' component at the start of answers, thereby explicitly training the retrieval step and activating latent retrieval abilities. A large bilingual long-context dataset (up to 32k context) is generated with GPT-4, comprising 2k short-form and 3k long-context samples across English and Chinese and designed for multi-document QA. Fine-tuning Qwen and Llama models with QLoRA on int4 weights yields improved long-context performance on LongBench and NaturalQuestions Multi-doc-QA, with modest degradation on some general-ability benchmarks, suggesting a practical, scalable path to stronger retrieval in long-context tasks and mitigated 'lost in the middle' effects.

Abstract

As Large Language Models (LLMs) continue to evolve, more are being designed to handle long-context inputs. Despite this advancement, most of them still face challenges in accurately handling long-context tasks, often showing the "lost in the middle" issue. We identify that insufficient retrieval capability is one of the important reasons for this issue. To tackle this challenge, we propose a novel approach to design training data for long-context tasks, aiming at augmenting LLMs' proficiency in extracting key information from long context. Specially, we incorporate an additional part named "paraphrasing the original text" when constructing the answer of training samples and then fine-tuning the model. Experimenting on LongBench and NaturalQuestions Multi-document-QA dataset with models of Llama and Qwen series, our method achieves an improvement of up to 8.48% and 4.48% in average scores, respectively, showing effectiveness in improving the model's performance on long-context tasks.

Training With "Paraphrasing the Original Text" Teaches LLM to Better Retrieve in Long-context Tasks

TL;DR

Abstract

Paper Structure (26 sections, 6 figures, 7 tables)

This paper contains 26 sections, 6 figures, 7 tables.

Introduction
Related Work
Input Context and Prompt
Training Data
Position Embedding
Attention Weights
Method
Models Often Fail to Fully Utilize Retrieval Ability
Original Text Paraphrasing
Dataset Construction
Experiments
Implementation Details
Evaluation
Evaluation on Long-context Tasks
Evaluation on "Lost in the middle" Issue
...and 11 more sections

Figures (6)

Figure 1: Our method adds "paraphrasing the original text" to the training samples.
Figure 2: An example of different answer design methods for a multi-doc-QA sample. In the context and answers, key information for answering the question are highlighted.
Figure 3: The pipeline of constructing our datasets with multi-doc-QA samples.
Figure 4: Qwen1.5-4b-Chat can nearly perfectly pass "Needle in a Haystack" test. The x-axis represents the length of the context, and the y-axis represents the position of the "needle" in the context.
Figure 5: The accuracy of Qwen1.5-4b-Chat in multi-doc-QA task drops rapidly, as the number of documents in the context grows, with the gold document always placed in the middle position of the context.
...and 1 more figures

Training With "Paraphrasing the Original Text" Teaches LLM to Better Retrieve in Long-context Tasks

TL;DR

Abstract

Training With "Paraphrasing the Original Text" Teaches LLM to Better Retrieve in Long-context Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (6)