Training With "Paraphrasing the Original Text" Teaches LLM to Better Retrieve in Long-context Tasks
Yijiong Yu, Yongfeng Huang, Zhixiao Qi, Zhe Zhou
TL;DR
This work tackles the challenge of long-context understanding in LLMs, specifically the 'lost in the middle' retrieval problem where key information in the middle of long inputs is not effectively used. It introduces a retrieval-focused fine-tuning approach by embedding an 'original text paraphrasing' component at the start of answers, thereby explicitly training the retrieval step and activating latent retrieval abilities. A large bilingual long-context dataset (up to 32k context) is generated with GPT-4, comprising 2k short-form and 3k long-context samples across English and Chinese and designed for multi-document QA. Fine-tuning Qwen and Llama models with QLoRA on int4 weights yields improved long-context performance on LongBench and NaturalQuestions Multi-doc-QA, with modest degradation on some general-ability benchmarks, suggesting a practical, scalable path to stronger retrieval in long-context tasks and mitigated 'lost in the middle' effects.
Abstract
As Large Language Models (LLMs) continue to evolve, more are being designed to handle long-context inputs. Despite this advancement, most of them still face challenges in accurately handling long-context tasks, often showing the "lost in the middle" issue. We identify that insufficient retrieval capability is one of the important reasons for this issue. To tackle this challenge, we propose a novel approach to design training data for long-context tasks, aiming at augmenting LLMs' proficiency in extracting key information from long context. Specially, we incorporate an additional part named "paraphrasing the original text" when constructing the answer of training samples and then fine-tuning the model. Experimenting on LongBench and NaturalQuestions Multi-document-QA dataset with models of Llama and Qwen series, our method achieves an improvement of up to 8.48% and 4.48% in average scores, respectively, showing effectiveness in improving the model's performance on long-context tasks.
