From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

Zheyang Xiong; Vasilis Papageorgiou; Kangwook Lee; Dimitris Papailiopoulos

From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

Zheyang Xiong, Vasilis Papageorgiou, Kangwook Lee, Dimitris Papailiopoulos

TL;DR

It is found that finetuned LLMs' performance on general benchmarks remains almost constant while LLMs finetuned on other baseline long-context augmentation data can encourage hallucination.

Abstract

Recent studies have shown that Large Language Models (LLMs) struggle to accurately retrieve information and maintain reasoning capabilities when processing long-context inputs. To address these limitations, we propose a finetuning approach utilizing a carefully designed synthetic dataset comprising numerical key-value retrieval tasks. Our experiments on models like GPT-3.5 Turbo and Mistral 7B demonstrate that finetuning LLMs on this dataset significantly improves LLMs' information retrieval and reasoning capabilities in longer-context settings. We present an analysis of the finetuned models, illustrating the transfer of skills from synthetic to real task evaluations (e.g., $10.5\%$ improvement on $20$ documents MDQA at position $10$ for GPT-3.5 Turbo). We also find that finetuned LLMs' performance on general benchmarks remains almost constant while LLMs finetuned on other baseline long-context augmentation data can encourage hallucination (e.g., on TriviaQA, Mistral 7B finetuned on our synthetic data cause no performance drop while other baseline data can cause a drop that ranges from $2.33\%$ to $6.19\%$). Our study highlights the potential of finetuning on synthetic data for improving the performance of LLMs on longer-context tasks.

From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

TL;DR

It is found that finetuned LLMs' performance on general benchmarks remains almost constant while LLMs finetuned on other baseline long-context augmentation data can encourage hallucination.

Abstract

improvement on

documents MDQA at position

for GPT-3.5 Turbo). We also find that finetuned LLMs' performance on general benchmarks remains almost constant while LLMs finetuned on other baseline long-context augmentation data can encourage hallucination (e.g., on TriviaQA, Mistral 7B finetuned on our synthetic data cause no performance drop while other baseline data can cause a drop that ranges from

). Our study highlights the potential of finetuning on synthetic data for improving the performance of LLMs on longer-context tasks.

Paper Structure (27 sections, 11 figures, 2 tables)

This paper contains 27 sections, 11 figures, 2 tables.

Introduction
Related work
Long Context LLMs.
Data-centric AI.
LLM Benchmarks and Evals.
Synthetic dataset of retrieval tasks
Simple dictionary key-value retrieval.
Multi-subkey dictionary key-value retrieval.
Prompt with an answer template.
Experiments and results
Stage 1: Finetuning LLMs on synthetic retrieval tasks
Stage 2: Evaluations on long context retrieval and reasoning tasks
Multi-document question answering (MDQA)
Finding 1:
Finding 2:
...and 12 more sections

Figures (11)

Figure 1: An example prompt with desired answer of simple dictionary key-value retrieval task.
Figure 2: An example prompt with desired answer of multi-subkey dictionary key-value retrieval task. Here (141, 623, 616) is the gold key. Note that 141 and 623 in the gold key are also subkeys of other keys.
Figure 3: The prompt of the simple dictionary key-value retrieval task is provided with an answer template.
Figure 4: Token-level loss on the target answer when provided with (right) and without (left) an answer template, where red indicates high and green low loss.
Figure 5: Performance of GPT-3.5 Turbo, Mistral 7B and their corresponding finetuned versions on the MDQA task.
...and 6 more figures

From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

TL;DR

Abstract

From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

Authors

TL;DR

Abstract

Table of Contents

Figures (11)