ICLEval: Evaluating In-Context Learning Ability of Large Language Models
Wentong Chen, Yankai Lin, ZhenHao Zhou, HongYun Huang, Yantao Jia, Zhao Cao, Ji-Rong Wen
TL;DR
ICLEval introduces a focused benchmark for evaluating in-context learning (ICL) in large language models, separating ICL into exact copying and rule learning and testing across unstructured and structured contexts. The framework uses 12 tasks with 2,040 samples to decouple ICL from language ability and knowledge, and analyzes effects of model size and pretraining progression, revealing that copying arises early in pretraining while rule learning scales with size more gradually. Key findings show that larger models are not the sole determinant of ICL performance, that tokenization and inherent model biases can significantly influence results, and that ICL abilities plateau after initial pretraining despite continued gains in knowledge. The work provides a publicly available benchmark and codebase to facilitate systematic evaluation of ICL across diverse open-source LLMs and highlights practical considerations for training models with robust ICL capabilities.
Abstract
In-Context Learning (ICL) is a critical capability of Large Language Models (LLMs) as it empowers them to comprehend and reason across interconnected inputs. Evaluating the ICL ability of LLMs can enhance their utilization and deepen our understanding of how this ability is acquired at the training stage. However, existing evaluation frameworks primarily focus on language abilities and knowledge, often overlooking the assessment of ICL ability. In this work, we introduce the ICLEval benchmark to evaluate the ICL abilities of LLMs, which encompasses two key sub-abilities: exact copying and rule learning. Through the ICLEval benchmark, we demonstrate that ICL ability is universally present in different LLMs, and model size is not the sole determinant of ICL efficacy. Surprisingly, we observe that ICL abilities, particularly copying, develop early in the pretraining process and stabilize afterward. Our source codes and benchmark are released at https://github.com/yiye3/ICLEval.
