Table of Contents
Fetching ...

60 Data Points are Sufficient to Fine-Tune LLMs for Question-Answering

Junjie Ye, Yuming Yang, Qi Zhang, Tao Gui, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan

TL;DR

This work shows that only about 60 SFT examples are needed to unlock QA capabilities in pretrained LLMs by activating stored knowledge. It introduces a memory-aware, multi-template complementation mechanism to quantify how knowledge is memorized and how SFT data memory levels affect QA across four LLMs. Key findings reveal a diagonal memory effect, model-specific data requirements, and strong performance gains when training data emphasizes higher-memory knowledge, with important implications for data-efficient, model-tailored SFT strategies. The results highlight the importance of data selection and memory alignment in QA fine-tuning and suggest practical paths to more reliable, data-efficient LLM QA systems.

Abstract

Large language models (LLMs) encode extensive world knowledge through pre-training on massive datasets, which can then be fine-tuned for the question-answering (QA) task. However, effective strategies for fine-tuning LLMs for the QA task remain largely unexplored. To address this gap, we categorize supervised fine-tuning (SFT) data based on the extent of knowledge memorized by the pretrained LLMs and conduct a series of empirical analyses. Our experiments, involving four LLMs from three different model families, focus on three key factors: the amount of data required for SFT, the impact of different SFT datasets on model performance, and how data requirements vary across LLMs. The results show that as few as 60 data points during the SFT stage can activate the knowledge encoded during pre-training, enabling LLMs to perform the QA task. Additionally, SFT with data of varying memory levels has a significant impact on LLM performance, with the optimal dataset differing based on the specific model being fine-tuned. Future research will delve deeper into the mechanisms underlying these phenomena.

60 Data Points are Sufficient to Fine-Tune LLMs for Question-Answering

TL;DR

This work shows that only about 60 SFT examples are needed to unlock QA capabilities in pretrained LLMs by activating stored knowledge. It introduces a memory-aware, multi-template complementation mechanism to quantify how knowledge is memorized and how SFT data memory levels affect QA across four LLMs. Key findings reveal a diagonal memory effect, model-specific data requirements, and strong performance gains when training data emphasizes higher-memory knowledge, with important implications for data-efficient, model-tailored SFT strategies. The results highlight the importance of data selection and memory alignment in QA fine-tuning and suggest practical paths to more reliable, data-efficient LLM QA systems.

Abstract

Large language models (LLMs) encode extensive world knowledge through pre-training on massive datasets, which can then be fine-tuned for the question-answering (QA) task. However, effective strategies for fine-tuning LLMs for the QA task remain largely unexplored. To address this gap, we categorize supervised fine-tuning (SFT) data based on the extent of knowledge memorized by the pretrained LLMs and conduct a series of empirical analyses. Our experiments, involving four LLMs from three different model families, focus on three key factors: the amount of data required for SFT, the impact of different SFT datasets on model performance, and how data requirements vary across LLMs. The results show that as few as 60 data points during the SFT stage can activate the knowledge encoded during pre-training, enabling LLMs to perform the QA task. Additionally, SFT with data of varying memory levels has a significant impact on LLM performance, with the optimal dataset differing based on the specific model being fine-tuned. Future research will delve deeper into the mechanisms underlying these phenomena.
Paper Structure (25 sections, 3 equations, 4 figures, 29 tables)

This paper contains 25 sections, 3 equations, 4 figures, 29 tables.

Figures (4)

  • Figure 1: An example for the multi-template complementation mechanism.
  • Figure 2: Performance (in-domain) of LLMs trained using different amounts of data. Each line in the plot represents training with data from a specific memory level.
  • Figure 3: Performance (out-of-domain) of LLMs trained using different amounts of data. Each line in the plot represents training with data from a specific memory level.
  • Figure 4: Heat maps showing differences in the distribution of memory levels for different LLMs on the training data $D_{train}$.