Table of Contents
Fetching ...

Deeper Insights Without Updates: The Power of In-Context Learning Over Fine-Tuning

Qingyu Yin, Xuzheng He, Luoao Deng, Chak Tou Leong, Fan Wang, Yanzhao Yan, Xiaoyu Shen, Qiang Zhang

TL;DR

For tasks with implicit patterns, ICL captures these patterns significantly better than fine-tuning, and circuit shift theory from a mechanistic interpretability's view is proposed to explain why ICL wins.

Abstract

Fine-tuning and in-context learning (ICL) are two prevalent methods in imbuing large language models with task-specific knowledge. It is commonly believed that fine-tuning can surpass ICL given sufficient training samples as it allows the model to adjust its internal parameters based on the data. However, this paper presents a counterintuitive finding: For tasks with implicit patterns, ICL captures these patterns significantly better than fine-tuning. We developed several datasets featuring implicit patterns, such as sequences determining answers through parity or identifying reducible terms in calculations. We then evaluated the models' understanding of these patterns under both fine-tuning and ICL across models ranging from 0.5B to 7B parameters. The results indicate that models employing ICL can quickly grasp deep patterns and significantly improve accuracy. In contrast, fine-tuning, despite utilizing thousands of times more training samples than ICL, achieved only limited improvements. We also proposed circuit shift theory from a mechanistic interpretability's view to explain why ICL wins.

Deeper Insights Without Updates: The Power of In-Context Learning Over Fine-Tuning

TL;DR

For tasks with implicit patterns, ICL captures these patterns significantly better than fine-tuning, and circuit shift theory from a mechanistic interpretability's view is proposed to explain why ICL wins.

Abstract

Fine-tuning and in-context learning (ICL) are two prevalent methods in imbuing large language models with task-specific knowledge. It is commonly believed that fine-tuning can surpass ICL given sufficient training samples as it allows the model to adjust its internal parameters based on the data. However, this paper presents a counterintuitive finding: For tasks with implicit patterns, ICL captures these patterns significantly better than fine-tuning. We developed several datasets featuring implicit patterns, such as sequences determining answers through parity or identifying reducible terms in calculations. We then evaluated the models' understanding of these patterns under both fine-tuning and ICL across models ranging from 0.5B to 7B parameters. The results indicate that models employing ICL can quickly grasp deep patterns and significantly improve accuracy. In contrast, fine-tuning, despite utilizing thousands of times more training samples than ICL, achieved only limited improvements. We also proposed circuit shift theory from a mechanistic interpretability's view to explain why ICL wins.
Paper Structure (45 sections, 6 equations, 6 figures, 8 tables)

This paper contains 45 sections, 6 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: (a) A simple example of an implicit pattern detection task. The given problem (arithmetic expression calculation task in this figure) can be solved in either a formal way, e.g., directly calculating, or by exploiting the detected implicit pattern as a shortcut. (b) Illustration of implicit pattern detection for in-context learning and fine-tuning. For ICL, several examples with answers are given in context, and a further new question is used to test accuracy. For fine-tuning, LLM learns from single examples using parameter update methods like full-parameter fine-tuning or PEFT methods.
  • Figure 2: Examples of implicit pattern detection for four reasoning tasks. The implicit pattern, once detected, can reward the model with reduced computation to arrive at the answer.
  • Figure 3: Robustness test of implicit pattern detection test. The horizontal axis represents the accuracy under clean input, and the vertical axis represents the accuracy under misleading input. Relatively speaking, the closer the results are to the bottom right corner, the worse the method’s resistance to misleading data. The closer the results are to the top left corner, the better it is.
  • Figure 4: The progression of loss and accuracy over time during the fine-tuning of implicit pattern tasks. The Real Loss Values (dashed blue line) show the loss during training. To mitigate this noise, the Smoothed Loss Values (solid blue line) provide a clearer trend of the overall loss reduction. We also show the average test accuracy over all tasks (solid green line).
  • Figure 5: Illustration of circuit shift comparison. LLMs are first detected circuits with activation patching. Then we compare how much their circuits changed after fine-tuning and in-context learning.
  • ...and 1 more figures