NICE: To Optimize In-Context Examples or Not?

Pragya Srivastava; Satvik Golechha; Amit Deshpande; Amit Sharma

NICE: To Optimize In-Context Examples or Not?

Pragya Srivastava, Satvik Golechha, Amit Deshpande, Amit Sharma

TL;DR

The paper questions the universal value of in-context example optimization (ICE) when task instructions are detailed, introducing the Normalized Invariability to Choice of Examples (NICE) metric to predict when ICE vs instruction optimization is advantageous. Through experiments across diverse tasks and models, it shows that detailed instructions often make model performance invariant to ICE choices (high NICE), enabling random ICE to match or outperform carefully selected demonstrations. Conversely, for tasks with complex output schemas (low NICE), ICE optimization remains beneficial, highlighting a trade-off that NICE can illuminate for efficient prompt engineering. The work provides practical guidance and a scalable metric to allocate compute between instruction design and ICE retrieval, with code available at the authors’ repository.

Abstract

Recent work shows that in-context learning and optimization of in-context examples (ICE) can significantly improve the accuracy of large language models (LLMs) on a wide range of tasks, leading to an apparent consensus that ICE optimization is crucial for better performance. However, most of these studies assume a fixed or no instruction provided in the prompt. We challenge this consensus by investigating the necessity of optimizing ICE when task-specific instructions are provided and find that there are many tasks for which it yields diminishing returns. In particular, using a diverse set of tasks and a systematically created instruction set with gradually added details, we find that as the prompt instruction becomes more detailed, the returns on ICE optimization diminish. To characterize this behavior, we introduce a task-specific metric called Normalized Invariability to Choice of Examples (NICE) that quantifies the learnability of tasks from a given instruction, and provides a heuristic to help decide whether to optimize instructions or ICE for a new task. Given a task, the proposed metric can reliably predict the utility of optimizing ICE compared to using random ICE. Our code is available at https://github.com/microsoft/nice-icl.

NICE: To Optimize In-Context Examples or Not?

TL;DR

Abstract

Paper Structure (33 sections, 12 equations, 15 figures, 2 tables)

This paper contains 33 sections, 12 equations, 15 figures, 2 tables.

Introduction
Related Work
ICE Selection Methods
Jointly Optimizing Instructions and ICE
Problem Formulation
Prompt Structure and Instruction Set
Research Questions
NICE: Measuring Invariability to ICE
Metric Requirements
Normalized Invariability to Choice of Examples (NICE)
Results
Experimental Setup
RQ1: Does ICE optimization always help?
RQ2: Optimized ICE versus Random ICE
RQ3: Do ground-truth labels matter?
...and 18 more sections

Figures (15)

Figure 1: Prompt optimization using the NICE score. Given a task and a template instruction, the NICE score provides a heuristic to decide whether to improve the instruction (high NICE) or the in-context examples (low NICE) to optimize task performance.
Figure 2: GPT-4's performance trend on various tasks as the average distance of ICE from the query increases. Arrow indicates change in NICE from no instruction to a detailed instruction. With a detailed instruction, NICE is near 1 for all tasks except schema-based generative tasks like MTOP and Break. Error bars are calculated from the standard deviation of accuracies across 50 queries.
Figure 3: Comparing ICE optimization methods against random ICE with a detailed instruction for classification and structured generation tasks using GPT-3.5. LM (Task Def. with Label Meanings), R (Task Def. with Rules).
Figure 4: Label perturbation (gold to random) performance for classification and structured generation tasks using GPT-3.5-turbo. Abbreviation key = DI (Delusive Instruction), NI (No Instruction), TD (Task Definition), +LS (Task Def. with Label Space), +LM (Task Def. with Label Meanings), +R (Task Def. with Rules). Error bars are calculated from the standard deviation in accuracies in the same way as in Fig. \ref{['fig:ice_v_inst']} (random example case).
Figure 5: Classification Task - Delusive Instruction.
...and 10 more figures

NICE: To Optimize In-Context Examples or Not?

TL;DR

Abstract

NICE: To Optimize In-Context Examples or Not?

Authors

TL;DR

Abstract

Table of Contents

Figures (15)