Test-Time Adaptation via Many-Shot Prompting: Benefits, Limits, and Pitfalls

Shubhangi Upasani; Chen Wu; Jay Rainton; Bo Li; Changran Hu; Qizheng Zhang; Urmish Thakker

Test-Time Adaptation via Many-Shot Prompting: Benefits, Limits, and Pitfalls

Shubhangi Upasani, Chen Wu, Jay Rainton, Bo Li, Changran Hu, Qizheng Zhang, Urmish Thakker

TL;DR

It is found that many-shot prompting is effective for structured tasks where demonstrations provide high information gain, but is highly sensitive to selection strategy and often shows limited benefits for open-ended generation tasks.

Abstract

Test-time adaptation enables large language models (LLMs) to modify their behavior at inference without updating model parameters. A common approach is many-shot prompting, where large numbers of in-context learning (ICL) examples are injected as an input-space test-time update. Although performance can improve as more demonstrations are added, the reliability and limits of this update mechanism remain poorly understood, particularly for open-source models. We present an empirical study of many-shot prompting across tasks and model backbones, analyzing how performance varies with update magnitude, example ordering, and selection policy. We further study Dynamic and Reinforced ICL as alternative test-time update strategies that control which information is injected and how it constrains model behavior. We find that many-shot prompting is effective for structured tasks where demonstrations provide high information gain, but is highly sensitive to selection strategy and often shows limited benefits for open-ended generation tasks. Overall, we characterize the practical limits of prompt-based test-time adaptation and outline when input-space updates are beneficial versus harmful.

Test-Time Adaptation via Many-Shot Prompting: Benefits, Limits, and Pitfalls

TL;DR

Abstract

Paper Structure (22 sections, 4 figures, 5 tables)

This paper contains 22 sections, 4 figures, 5 tables.

Introduction
Background
Many-Shot Prompting (Update Magnitude):
Dynamic ICL as Update Policy:
Reinforced ICL as Update Structure:
Experimental Setup
Results: Benefits And Limits of Test-Time Updates
More Context Helps—Until It Doesn’t
Update policy matters—relevance helps early, diversity helps at scale
Larger models benefit earlier, smaller models catch up
Reinforced ICL exhibits early gains and rapid saturation
Task structure determines the effectiveness of test-time updates
Conclusion
Appendix
Additional Background on Prompt-Based Test-Time Adaptation
...and 7 more sections

Figures (4)

Figure 1: A unified view of prompt-based test-time adaptation. Update design determines whether added context provides signal or noise.
Figure 2: Scaling behavior of prompt-based test-time updates. (a) Many-shot accuracy vs. update magnitude (b) Dynamic ICL update policies (c) Model capacity effects (d) Reinforced ICL scaling
Figure 3: Dynamic ICL Selection Strategies on Banking 77
Figure 4: Context length scaling with update magnitude

Test-Time Adaptation via Many-Shot Prompting: Benefits, Limits, and Pitfalls

TL;DR

Abstract

Test-Time Adaptation via Many-Shot Prompting: Benefits, Limits, and Pitfalls

Authors

TL;DR

Abstract

Table of Contents

Figures (4)