Table of Contents
Fetching ...

Deconstructing In-Context Learning: Understanding Prompts via Corruption

Namrata Shivagunde, Vladislav Lialin, Sherin Muckatira, Anna Rumshisky

TL;DR

This work deconstructs in-context learning by splitting prompts into four components—task instructions, demonstration inputs, demonstration labels, and inline instructions—and studies how structural and semantic corruptions across these components affect performance. Using ten models from 1.5B to 70B and ten datasets, the authors show that repeated or redundant text in prompts can boost results, inline instructions often matter more than task descriptions, and larger models exhibit stronger sensitivity to semantic quality. They also analyze attention allocation to reveal that bigger models focus more on relevant prompt parts (labels, inline instructions, demonstrations) than separators. The findings offer practical guidelines for prompting strategies and illuminate differences in how model size and prompt design interact during zero-shot and few-shot tasks. Overall, the study advances understanding of what drives ICL robustness and how to optimize prompts for backbone LLMs in real-world deployments.

Abstract

The ability of large language models (LLMs) to $``$learn in context$"$ based on the provided prompt has led to an explosive growth in their use, culminating in the proliferation of AI assistants such as ChatGPT, Claude, and Bard. These AI assistants are known to be robust to minor prompt modifications, mostly due to alignment techniques that use human feedback. In contrast, the underlying pre-trained LLMs they use as a backbone are known to be brittle in this respect. Building high-quality backbone models remains a core challenge, and a common approach to assessing their quality is to conduct few-shot evaluation. Such evaluation is notorious for being highly sensitive to minor prompt modifications, as well as the choice of specific in-context examples. Prior work has examined how modifying different elements of the prompt can affect model performance. However, these earlier studies tended to concentrate on a limited number of specific prompt attributes and often produced contradictory results. Additionally, previous research either focused on models with fewer than 15 billion parameters or exclusively examined black-box models like GPT-3 or PaLM, making replication challenging. In the present study, we decompose the entire prompt into four components: task description, demonstration inputs, labels, and inline instructions provided for each demonstration. We investigate the effects of structural and semantic corruptions of these elements on model performance. We study models ranging from 1.5B to 70B in size, using ten datasets covering classification and generation tasks. We find that repeating text within the prompt boosts model performance, and bigger models ($\geq$30B) are more sensitive to the semantics of the prompt. Finally, we observe that adding task and inline instructions to the demonstrations enhances model performance even when the instructions are semantically corrupted.

Deconstructing In-Context Learning: Understanding Prompts via Corruption

TL;DR

This work deconstructs in-context learning by splitting prompts into four components—task instructions, demonstration inputs, demonstration labels, and inline instructions—and studies how structural and semantic corruptions across these components affect performance. Using ten models from 1.5B to 70B and ten datasets, the authors show that repeated or redundant text in prompts can boost results, inline instructions often matter more than task descriptions, and larger models exhibit stronger sensitivity to semantic quality. They also analyze attention allocation to reveal that bigger models focus more on relevant prompt parts (labels, inline instructions, demonstrations) than separators. The findings offer practical guidelines for prompting strategies and illuminate differences in how model size and prompt design interact during zero-shot and few-shot tasks. Overall, the study advances understanding of what drives ICL robustness and how to optimize prompts for backbone LLMs in real-world deployments.

Abstract

The ability of large language models (LLMs) to learn in context based on the provided prompt has led to an explosive growth in their use, culminating in the proliferation of AI assistants such as ChatGPT, Claude, and Bard. These AI assistants are known to be robust to minor prompt modifications, mostly due to alignment techniques that use human feedback. In contrast, the underlying pre-trained LLMs they use as a backbone are known to be brittle in this respect. Building high-quality backbone models remains a core challenge, and a common approach to assessing their quality is to conduct few-shot evaluation. Such evaluation is notorious for being highly sensitive to minor prompt modifications, as well as the choice of specific in-context examples. Prior work has examined how modifying different elements of the prompt can affect model performance. However, these earlier studies tended to concentrate on a limited number of specific prompt attributes and often produced contradictory results. Additionally, previous research either focused on models with fewer than 15 billion parameters or exclusively examined black-box models like GPT-3 or PaLM, making replication challenging. In the present study, we decompose the entire prompt into four components: task description, demonstration inputs, labels, and inline instructions provided for each demonstration. We investigate the effects of structural and semantic corruptions of these elements on model performance. We study models ranging from 1.5B to 70B in size, using ten datasets covering classification and generation tasks. We find that repeating text within the prompt boosts model performance, and bigger models (30B) are more sensitive to the semantics of the prompt. Finally, we observe that adding task and inline instructions to the demonstrations enhances model performance even when the instructions are semantically corrupted.
Paper Structure (29 sections, 15 figures, 6 tables)

This paper contains 29 sections, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Prompt Components of Twitter Emotion Classification baseline prompt. Demonstration includes input, inline instruction, label. Two newlines are added as separators after task instruction and each demonstration. Prompts taken verbatim from Super-NaturalInstructions and PromptSource.
  • Figure 2: Demonstrations improve the average score, adding task and inline instruction improves it further, even when instructions are just random words. The Y-axis represents the average score across all datasets. The use of random words is indicated with "rw".
  • Figure 3: Adding relevant or meaningless instruction to the prompt improves model performance. The components are added to the test instance. For example '+ demonstrations' means test instance + demonstration. The Y-axis represents the average score across all datasets. Random words are indicated with "rw".
  • Figure 4: Repeated text boosts performance. Inline instruction in four demos is the baseline prompt. Inline instruction which occurs after the test instance is kept as is.
  • Figure 5: Repeated text boosts performance even when the text is irrelevant; "rw" refers to random words. The prompts include all components but the instructions are replaced with random words.
  • ...and 10 more figures