Table of Contents
Fetching ...

You Only Fine-tune Once: Many-Shot In-Context Fine-Tuning for Large Language Models

Wenchong He, Liqian Peng, Zhe Jiang, Alex Go

TL;DR

It is demonstrated that ManyICL substantially outperforms zero/few-shot fine-tuning and approaches the performance of dedicated fine-tuning, and significantly mitigates catastrophic forgetting issues observed in zero/few-shot fine-tuning.

Abstract

Large language models (LLMs) possess a remarkable ability to perform in-context learning (ICL), which enables them to handle multiple downstream tasks simultaneously without requiring task-specific fine-tuning. Recent studies have shown that even moderately sized LLMs, such as Mistral 7B, Gemma 7B and Llama-3 8B, can achieve ICL through few-shot in-context fine-tuning of all tasks at once. However, this approach still lags behind dedicated fine-tuning, where a separate model is trained for each individual task. In this paper, we propose a novel approach, Many-Shot In-Context Fine-tuning (ManyICL), which significantly narrows this performance gap by extending the principles of ICL to a many-shot setting. To unlock the full potential of ManyICL and address the inherent inefficiency of processing long sequences with numerous in-context examples, we propose a novel training objective. Instead of solely predicting the final answer, our approach treats every answer within the context as a supervised training target. This effectively shifts the role of many-shot examples from prompts to targets for autoregressive learning. Through extensive experiments on diverse downstream tasks, including classification, summarization, question answering, natural language inference, and math, we demonstrate that ManyICL substantially outperforms zero/few-shot fine-tuning and approaches the performance of dedicated fine-tuning. Furthermore, ManyICL significantly mitigates catastrophic forgetting issues observed in zero/few-shot fine-tuning. The code will be made publicly available upon publication.

You Only Fine-tune Once: Many-Shot In-Context Fine-Tuning for Large Language Models

TL;DR

It is demonstrated that ManyICL substantially outperforms zero/few-shot fine-tuning and approaches the performance of dedicated fine-tuning, and significantly mitigates catastrophic forgetting issues observed in zero/few-shot fine-tuning.

Abstract

Large language models (LLMs) possess a remarkable ability to perform in-context learning (ICL), which enables them to handle multiple downstream tasks simultaneously without requiring task-specific fine-tuning. Recent studies have shown that even moderately sized LLMs, such as Mistral 7B, Gemma 7B and Llama-3 8B, can achieve ICL through few-shot in-context fine-tuning of all tasks at once. However, this approach still lags behind dedicated fine-tuning, where a separate model is trained for each individual task. In this paper, we propose a novel approach, Many-Shot In-Context Fine-tuning (ManyICL), which significantly narrows this performance gap by extending the principles of ICL to a many-shot setting. To unlock the full potential of ManyICL and address the inherent inefficiency of processing long sequences with numerous in-context examples, we propose a novel training objective. Instead of solely predicting the final answer, our approach treats every answer within the context as a supervised training target. This effectively shifts the role of many-shot examples from prompts to targets for autoregressive learning. Through extensive experiments on diverse downstream tasks, including classification, summarization, question answering, natural language inference, and math, we demonstrate that ManyICL substantially outperforms zero/few-shot fine-tuning and approaches the performance of dedicated fine-tuning. Furthermore, ManyICL significantly mitigates catastrophic forgetting issues observed in zero/few-shot fine-tuning. The code will be made publicly available upon publication.

Paper Structure

This paper contains 29 sections, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Comparison of different fine-tuning strategies. Many-shot fine-tuning for ICL (solid black line) achieves performance comparable to task-level LoRA fine-tuning (dotted red line, five different models). The tasks include Classification (CLS), Multilingual Summarization (SUM), Question Answering (QA), Natural Language Inference (NLI). Multi-label classification (ML CLS)
  • Figure 2: Comparison between ManyICFT and task-level fine-tuning workflows. Task-level fine-tuning requires maintaining a separate model for each downstream task, whereas ManyICFT adapts effectively to unseen datasets using many-shot prompting within a single model ("FT" is the fine-tuned model, "Base" is the base LLM model).
  • Figure 3: Comparison of attention mechanisms between mask-last-target and mask-all-targets. The blue represents the input prompts and green represents the target outputs. The red square in the inference grid denotes the in-context examples.
  • Figure 4: Illustration of various fine-tuning approaches for ICL models. The figure shows zero-shot, few-shot, and many-shot fine-tuning methods. For both few-shot ($n=2$ here) and many-shot ICL, we can apply either mask last target or mask all targets strategy.
  • Figure 5: Scaling test on the number of shots. Many-shot fine-tuning achieves superior performance in both few-shot and many-shot conditions. For CLS, when the number of shots is around 1.5K, many-shot fine-tuning achieves comparable performance to the task-level fine-tuning (red line). Similarly for other tasks, many-shot fine-tuning reduces the gap between ICL and task-level fine-tuning.
  • ...and 1 more figures