Table of Contents
Fetching ...

Understanding the Dynamics of Demonstration Conflict in In-Context Learning

Difan Jiao, Di Wang, Lijie Hu

TL;DR

This work finds that models suffer substantial performance degradation from a single demonstration with corrupted rule, and identifies attention heads for each phase underlying the reasoning failures: Vulnerability Heads in early-to-middle layers exhibit positional attention bias with high sensitivity to corruption, while Susceptible Heads in late layers significantly reduce support for correct predictions when exposed to the corrupted evidence.

Abstract

In-context learning enables large language models to perform novel tasks through few-shot demonstrations. However, demonstrations per se can naturally contain noise and conflicting examples, making this capability vulnerable. To understand how models process such conflicts, we study demonstration-dependent tasks requiring models to infer underlying patterns, a process we characterize as rule inference. We find that models suffer substantial performance degradation from a single demonstration with corrupted rule. This systematic misleading behavior motivates our investigation of how models process conflicting evidence internally. Using linear probes and logit lens analysis, we discover that under corruption models encode both correct and incorrect rules in intermediate layers but develop prediction confidence only in late layers, revealing a two-phase computational structure. We then identify attention heads for each phase underlying the reasoning failures: Vulnerability Heads in early-to-middle layers exhibit positional attention bias with high sensitivity to corruption, while Susceptible Heads in late layers significantly reduce support for correct predictions when exposed to the corrupted evidence. Targeted ablation validates our findings, with masking a small number of identified heads improving performance by over 10%.

Understanding the Dynamics of Demonstration Conflict in In-Context Learning

TL;DR

This work finds that models suffer substantial performance degradation from a single demonstration with corrupted rule, and identifies attention heads for each phase underlying the reasoning failures: Vulnerability Heads in early-to-middle layers exhibit positional attention bias with high sensitivity to corruption, while Susceptible Heads in late layers significantly reduce support for correct predictions when exposed to the corrupted evidence.

Abstract

In-context learning enables large language models to perform novel tasks through few-shot demonstrations. However, demonstrations per se can naturally contain noise and conflicting examples, making this capability vulnerable. To understand how models process such conflicts, we study demonstration-dependent tasks requiring models to infer underlying patterns, a process we characterize as rule inference. We find that models suffer substantial performance degradation from a single demonstration with corrupted rule. This systematic misleading behavior motivates our investigation of how models process conflicting evidence internally. Using linear probes and logit lens analysis, we discover that under corruption models encode both correct and incorrect rules in intermediate layers but develop prediction confidence only in late layers, revealing a two-phase computational structure. We then identify attention heads for each phase underlying the reasoning failures: Vulnerability Heads in early-to-middle layers exhibit positional attention bias with high sensitivity to corruption, while Susceptible Heads in late layers significantly reduce support for correct predictions when exposed to the corrupted evidence. Targeted ablation validates our findings, with masking a small number of identified heads improving performance by over 10%.
Paper Structure (37 sections, 7 equations, 12 figures, 4 tables)

This paper contains 37 sections, 7 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Left: The Operator Induction task where models are required to infer the underlying rule (+) from demonstrations, among which we introduce one corrupted example. Middle: Vulnerability heads show disproportionate attention allocation; they also show drastic output changes when the heavily-attended position corrupts. Right: Susceptible heads shows logit contributions favoring the corrupted operator ($\times$) over the correct one (+) despite minority corruption.
  • Figure 2: Performance degradation under single-position corruption across different large language models and tasks. Each point represents the decrease in accuracy when corrupting the demonstration at that specific position.
  • Figure 3: Linear probe confidence across model layers under different corruption scenarios. The Correct Probe detects model's encoding of ground-truth operator while the Corrupted Probe detects the corrupted operators.
  • Figure 4: Logit lens predictions across corruption scenarios. Rows show different layers, columns represent correct and corrupted rules, and values indicate prediction probability for each rule decoded from layer-wise residual streams.
  • Figure 5: Distribution of Vulnerability Heads across model layers. The purple line shows the average vulnerability score (A × S) among the top 5 heads per layer. Individual heads are shown as dots.
  • ...and 7 more figures