Table of Contents
Fetching ...

Test-Time Adaptation via Many-Shot Prompting: Benefits, Limits, and Pitfalls

Shubhangi Upasani, Chen Wu, Jay Rainton, Bo Li, Changran Hu, Qizheng Zhang, Urmish Thakker

TL;DR

It is found that many-shot prompting is effective for structured tasks where demonstrations provide high information gain, but is highly sensitive to selection strategy and often shows limited benefits for open-ended generation tasks.

Abstract

Test-time adaptation enables large language models (LLMs) to modify their behavior at inference without updating model parameters. A common approach is many-shot prompting, where large numbers of in-context learning (ICL) examples are injected as an input-space test-time update. Although performance can improve as more demonstrations are added, the reliability and limits of this update mechanism remain poorly understood, particularly for open-source models. We present an empirical study of many-shot prompting across tasks and model backbones, analyzing how performance varies with update magnitude, example ordering, and selection policy. We further study Dynamic and Reinforced ICL as alternative test-time update strategies that control which information is injected and how it constrains model behavior. We find that many-shot prompting is effective for structured tasks where demonstrations provide high information gain, but is highly sensitive to selection strategy and often shows limited benefits for open-ended generation tasks. Overall, we characterize the practical limits of prompt-based test-time adaptation and outline when input-space updates are beneficial versus harmful.

Test-Time Adaptation via Many-Shot Prompting: Benefits, Limits, and Pitfalls

TL;DR

It is found that many-shot prompting is effective for structured tasks where demonstrations provide high information gain, but is highly sensitive to selection strategy and often shows limited benefits for open-ended generation tasks.

Abstract

Test-time adaptation enables large language models (LLMs) to modify their behavior at inference without updating model parameters. A common approach is many-shot prompting, where large numbers of in-context learning (ICL) examples are injected as an input-space test-time update. Although performance can improve as more demonstrations are added, the reliability and limits of this update mechanism remain poorly understood, particularly for open-source models. We present an empirical study of many-shot prompting across tasks and model backbones, analyzing how performance varies with update magnitude, example ordering, and selection policy. We further study Dynamic and Reinforced ICL as alternative test-time update strategies that control which information is injected and how it constrains model behavior. We find that many-shot prompting is effective for structured tasks where demonstrations provide high information gain, but is highly sensitive to selection strategy and often shows limited benefits for open-ended generation tasks. Overall, we characterize the practical limits of prompt-based test-time adaptation and outline when input-space updates are beneficial versus harmful.
Paper Structure (22 sections, 4 figures, 5 tables)

This paper contains 22 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: A unified view of prompt-based test-time adaptation. Update design determines whether added context provides signal or noise.
  • Figure 2: Scaling behavior of prompt-based test-time updates. (a) Many-shot accuracy vs. update magnitude (b) Dynamic ICL update policies (c) Model capacity effects (d) Reinforced ICL scaling
  • Figure 3: Dynamic ICL Selection Strategies on Banking 77
  • Figure 4: Context length scaling with update magnitude