Table of Contents
Fetching ...

ICLEval: Evaluating In-Context Learning Ability of Large Language Models

Wentong Chen, Yankai Lin, ZhenHao Zhou, HongYun Huang, Yantao Jia, Zhao Cao, Ji-Rong Wen

TL;DR

ICLEval introduces a focused benchmark for evaluating in-context learning (ICL) in large language models, separating ICL into exact copying and rule learning and testing across unstructured and structured contexts. The framework uses 12 tasks with 2,040 samples to decouple ICL from language ability and knowledge, and analyzes effects of model size and pretraining progression, revealing that copying arises early in pretraining while rule learning scales with size more gradually. Key findings show that larger models are not the sole determinant of ICL performance, that tokenization and inherent model biases can significantly influence results, and that ICL abilities plateau after initial pretraining despite continued gains in knowledge. The work provides a publicly available benchmark and codebase to facilitate systematic evaluation of ICL across diverse open-source LLMs and highlights practical considerations for training models with robust ICL capabilities.

Abstract

In-Context Learning (ICL) is a critical capability of Large Language Models (LLMs) as it empowers them to comprehend and reason across interconnected inputs. Evaluating the ICL ability of LLMs can enhance their utilization and deepen our understanding of how this ability is acquired at the training stage. However, existing evaluation frameworks primarily focus on language abilities and knowledge, often overlooking the assessment of ICL ability. In this work, we introduce the ICLEval benchmark to evaluate the ICL abilities of LLMs, which encompasses two key sub-abilities: exact copying and rule learning. Through the ICLEval benchmark, we demonstrate that ICL ability is universally present in different LLMs, and model size is not the sole determinant of ICL efficacy. Surprisingly, we observe that ICL abilities, particularly copying, develop early in the pretraining process and stabilize afterward. Our source codes and benchmark are released at https://github.com/yiye3/ICLEval.

ICLEval: Evaluating In-Context Learning Ability of Large Language Models

TL;DR

ICLEval introduces a focused benchmark for evaluating in-context learning (ICL) in large language models, separating ICL into exact copying and rule learning and testing across unstructured and structured contexts. The framework uses 12 tasks with 2,040 samples to decouple ICL from language ability and knowledge, and analyzes effects of model size and pretraining progression, revealing that copying arises early in pretraining while rule learning scales with size more gradually. Key findings show that larger models are not the sole determinant of ICL performance, that tokenization and inherent model biases can significantly influence results, and that ICL abilities plateau after initial pretraining despite continued gains in knowledge. The work provides a publicly available benchmark and codebase to facilitate systematic evaluation of ICL across diverse open-source LLMs and highlights practical considerations for training models with robust ICL capabilities.

Abstract

In-Context Learning (ICL) is a critical capability of Large Language Models (LLMs) as it empowers them to comprehend and reason across interconnected inputs. Evaluating the ICL ability of LLMs can enhance their utilization and deepen our understanding of how this ability is acquired at the training stage. However, existing evaluation frameworks primarily focus on language abilities and knowledge, often overlooking the assessment of ICL ability. In this work, we introduce the ICLEval benchmark to evaluate the ICL abilities of LLMs, which encompasses two key sub-abilities: exact copying and rule learning. Through the ICLEval benchmark, we demonstrate that ICL ability is universally present in different LLMs, and model size is not the sole determinant of ICL efficacy. Surprisingly, we observe that ICL abilities, particularly copying, develop early in the pretraining process and stabilize afterward. Our source codes and benchmark are released at https://github.com/yiye3/ICLEval.
Paper Structure (29 sections, 20 figures, 9 tables)

This paper contains 29 sections, 20 figures, 9 tables.

Figures (20)

  • Figure 1: The examples of six representative tasks in ICLEval.
  • Figure 2: The scores in the pretraining stage of TinyLlama-1.1B with 3T tokens.
  • Figure 3: The scores in the pretraining stage of Baichuan2-7B with 2.6T tokens.
  • Figure 4: Baichuan2-7B's ICL ability and knowledge. The ICL ability of the model is acquired in the early stage of pretraining, while the knowledge is acquired in the whole pretraining stage.
  • Figure 5: Performance changes when there are more similar strings in the in-context examples.
  • ...and 15 more figures