NICE: To Optimize In-Context Examples or Not?
Pragya Srivastava, Satvik Golechha, Amit Deshpande, Amit Sharma
TL;DR
The paper questions the universal value of in-context example optimization (ICE) when task instructions are detailed, introducing the Normalized Invariability to Choice of Examples (NICE) metric to predict when ICE vs instruction optimization is advantageous. Through experiments across diverse tasks and models, it shows that detailed instructions often make model performance invariant to ICE choices (high NICE), enabling random ICE to match or outperform carefully selected demonstrations. Conversely, for tasks with complex output schemas (low NICE), ICE optimization remains beneficial, highlighting a trade-off that NICE can illuminate for efficient prompt engineering. The work provides practical guidance and a scalable metric to allocate compute between instruction design and ICE retrieval, with code available at the authors’ repository.
Abstract
Recent work shows that in-context learning and optimization of in-context examples (ICE) can significantly improve the accuracy of large language models (LLMs) on a wide range of tasks, leading to an apparent consensus that ICE optimization is crucial for better performance. However, most of these studies assume a fixed or no instruction provided in the prompt. We challenge this consensus by investigating the necessity of optimizing ICE when task-specific instructions are provided and find that there are many tasks for which it yields diminishing returns. In particular, using a diverse set of tasks and a systematically created instruction set with gradually added details, we find that as the prompt instruction becomes more detailed, the returns on ICE optimization diminish. To characterize this behavior, we introduce a task-specific metric called Normalized Invariability to Choice of Examples (NICE) that quantifies the learnability of tasks from a given instruction, and provides a heuristic to help decide whether to optimize instructions or ICE for a new task. Given a task, the proposed metric can reliably predict the utility of optimizing ICE compared to using random ICE. Our code is available at https://github.com/microsoft/nice-icl.
