More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives

Xiaoqing Zhang; Ang Lv; Yuhan Liu; Flood Sung; Wei Liu; Jian Luan; Shuo Shang; Xiuying Chen; Rui Yan

More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives

Xiaoqing Zhang, Ang Lv, Yuhan Liu, Flood Sung, Wei Liu, Jian Luan, Shuo Shang, Xiuying Chen, Rui Yan

TL;DR

DrICL tackles the decline of many-shot ICL by addressing two sources: suboptimal NLL optimization and increasing data noise. It introduces a global differentiated learning objective $L_{diff} = (1 + α) L_{many-shot} + (1 − α) L_{zero-shot}$ and a local advantage-based reweighting scheme using a cumulative advantage $\mathcal{A}_k = \exp(\mathcal{R}_k / γ)$ with reward $\mathcal{R}_k = L_{many-shot,k} − L_{sampling_{w-1}}$ to downweight noisy demonstrations. The authors also provide the ICL-50 benchmark, a large-scale 50-task dataset with shot counts from 1 to 350, to study many-shot ICL across diverse tasks. Experiments on Llama-2-7b-chat-hf and Mistral-7B-Instruct-v0.2 show significant, stable improvements in both in-domain and out-of-domain settings, and the authors release code and ICL-50 to spur further research.

Abstract

Large language models (LLMs) excel at few-shot in-context learning (ICL) without requiring parameter updates. However, as ICL demonstrations increase from a few to many, performance tends to plateau and eventually decline. We identify two primary causes for this trend: the suboptimal negative log-likelihood (NLL) optimization objective and the incremental data noise. To address these issues, we introduce \textit{DrICL}, a novel optimization method that enhances model performance through \textit{Differentiated} and \textit{Reweighting} objectives. Globally, DrICL utilizes differentiated learning to optimize the NLL objective, ensuring that many-shot performance surpasses zero-shot levels. Locally, it dynamically adjusts the weighting of many-shot demonstrations by leveraging cumulative advantages inspired by reinforcement learning, thereby mitigating the impact of noisy data. Recognizing the lack of multi-task datasets with diverse many-shot distributions, we develop the \textit{Many-Shot ICL Benchmark} (ICL-50)-a large-scale benchmark of 50 tasks that cover shot numbers from 1 to 350 within sequences of up to 8,000 tokens-for both fine-tuning and evaluation purposes. Experimental results demonstrate that LLMs enhanced with DrICL achieve significant improvements in many-shot setups across various tasks, including both in-domain and out-of-domain scenarios. We release the code and dataset hoping to facilitate further research in many-shot ICL\footnote{https://github.com/xiaoqzhwhu/DrICL}.

More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives

TL;DR

DrICL tackles the decline of many-shot ICL by addressing two sources: suboptimal NLL optimization and increasing data noise. It introduces a global differentiated learning objective

and a local advantage-based reweighting scheme using a cumulative advantage

with reward

to downweight noisy demonstrations. The authors also provide the ICL-50 benchmark, a large-scale 50-task dataset with shot counts from 1 to 350, to study many-shot ICL across diverse tasks. Experiments on Llama-2-7b-chat-hf and Mistral-7B-Instruct-v0.2 show significant, stable improvements in both in-domain and out-of-domain settings, and the authors release code and ICL-50 to spur further research.

Abstract

Paper Structure (33 sections, 9 equations, 8 figures, 11 tables, 1 algorithm)

This paper contains 33 sections, 9 equations, 8 figures, 11 tables, 1 algorithm.

Introduction
Related Work
DrICL
Global Perspective: Differentiated Learning
Local Perspective: Advantage-based Reweighting
Importance Sampling
Advantage Functions
Reweighting
Learning Strategy
Experiments
Experimental Setup
Datasets
Base Models
Evaluation Metrics
Implementation Details
...and 18 more sections

Figures (8)

Figure 1: The performance trend of LLMs across different $k$-shots scenarios. $k$ refers to the number of demonstration examples provided to LLMs, "+MetaICL" uses MetaICL for fine-tuning, while "+DrICL" uses our DrICL strategy.
Figure 2: The DrICL Training Framework. (a) The global differentiated learning for many-shot and zero-shot demonstrations. (b) The local advantage-based reweighting method assigns differential weights to demonstrations in window $w$ with window size $|W|=3$ and sampling size $|S|=1$, utilizing the cumulative advantage from the preceding window $w-1$.
Figure 3: The performance with incremental $k$-shots for Mistral-7B-Instruct-v0.2 and Llama-2-7b-chat-hf on CLSClusteringS2S under different strategies. We focus on CLSClusteringS2S for its high $k$-shot count, enabling a broader evaluation of DrICL. Our DrICL consistently shows better performance with a diverse range of $k$.
Figure 4: The token distributions of each task dataset.
Figure 5: The $k$-shots distributions of each task dataset.
...and 3 more figures

More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives

TL;DR

Abstract

More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives

Authors

TL;DR

Abstract

Table of Contents

Figures (8)