Focused Large Language Models are Stable Many-Shot Learners

Peiwen Yuan; Shaoxiong Feng; Yiwei Li; Xinglin Wang; Yueqi Zhang; Chuyi Tan; Boyuan Pan; Heda Wang; Yao Hu; Kan Li

Focused Large Language Models are Stable Many-Shot Learners

Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Heda Wang, Yao Hu, Kan Li

TL;DR

This work proposes a training-free method FocusICL, which conducts triviality filtering to avoid attention being diverted by unimportant contents at token-level and operates hierarchical attention to further ensure sufficient attention towards current query at demonstration-level and designs an efficient hyperparameter searching strategy based on model perplexity of demonstrations.

Abstract

In-Context Learning (ICL) enables large language models (LLMs) to achieve rapid task adaptation by learning from demonstrations. With the increase in available context length of LLMs, recent experiments have shown that the performance of ICL does not necessarily scale well in many-shot (demonstration) settings. We theoretically and experimentally confirm that the reason lies in more demonstrations dispersing the model attention from the query, hindering its understanding of key content. Inspired by how humans learn from examples, we propose a training-free method FocusICL, which conducts triviality filtering to avoid attention being diverted by unimportant contents at token-level and operates hierarchical attention to further ensure sufficient attention towards current query at demonstration-level. We also design an efficient hyperparameter searching strategy for FocusICL based on model perplexity of demonstrations. Comprehensive experiments validate that FocusICL achieves an average performance improvement of 5.2% over vanilla ICL and scales well with many-shot demonstrations.

Focused Large Language Models are Stable Many-Shot Learners

TL;DR

Abstract

Paper Structure (42 sections, 14 equations, 8 figures, 11 tables, 1 algorithm)

This paper contains 42 sections, 14 equations, 8 figures, 11 tables, 1 algorithm.

Introduction
Background
Formalization of ICL
Scaling Demonstration Number
Revisiting
Approximating ICL as Finetuning
Finetuning
ICL
Ignorance of Attention Competition
Experimental Evidence for Hypothesis
Methodology
Triviality Filtering
Hierarchical Attention
Hyperparameter Searching
Experiments
...and 27 more sections

Figures (8)

Figure 1: The average model attention for query is dispersed by the increased number of demonstrations, causing inadequate understanding of query.
Figure 2: Accuracy and attention of longchat-7b-v1.5-32k with varying number of spaces added per demonstration. Demonstration number is set as 100.
Figure 3: Overall illustration of FocusICL.
Figure 4: Input details of FocusICL.
Figure 5: FocusICL helps different LLMs scale well with many-shot demonstrations compared with ICL.
...and 3 more figures

Focused Large Language Models are Stable Many-Shot Learners

TL;DR

Abstract

Focused Large Language Models are Stable Many-Shot Learners

Authors

TL;DR

Abstract

Table of Contents

Figures (8)