Table of Contents
Fetching ...

Inference and Verbalization Functions During In-Context Learning

Junyi Tao, Xiaoyin Chen, Nelson F. Liu

TL;DR

This work probes how in-context learning in large language models operates when demonstrations use irrelevant or misleading label words. It proposes a two-function causal decomposition: an inference function that derives an answer representation and a verbalization function that maps that representation to the demonstrated label space, with the inference function invariant to label remappings. Using interchange interventions, the authors demonstrate that these functions can be localized in separate, consistent layers across multiple models and tasks, supporting a mechanistic account of ICL. The findings show middle-layer interventions reliably induce counterfactual outputs, suggesting robust transfer of information between the two functions, and generalize across datasets like MultiNLI, RTE, ANLI, IMDb, and AGNews. Overall, the paper advances a causal, mechanistic understanding of ICL and offers a methodology for probing internal representations with potential implications for robust prompt design and model interpretability ($H_1$,$H_2$,$H_3$).

Abstract

Large language models (LMs) are capable of in-context learning from a few demonstrations (example-label pairs) to solve new tasks during inference. Despite the intuitive importance of high-quality demonstrations, previous work has observed that, in some settings, ICL performance is minimally affected by irrelevant labels (Min et al., 2022). We hypothesize that LMs perform ICL with irrelevant labels via two sequential processes: an inference function that solves the task, followed by a verbalization function that maps the inferred answer to the label space. Importantly, we hypothesize that the inference function is invariant to remappings of the label space (e.g., "true"/"false" to "cat"/"dog"), enabling LMs to share the same inference function across settings with different label words. We empirically validate this hypothesis with controlled layer-wise interchange intervention experiments. Our findings confirm the hypotheses on multiple datasets and tasks (natural language inference, sentiment analysis, and topic classification) and further suggest that the two functions can be localized in specific layers across various open-sourced models, including GEMMA-7B, MISTRAL-7B-V0.3, GEMMA-2-27B, and LLAMA-3.1-70B.

Inference and Verbalization Functions During In-Context Learning

TL;DR

This work probes how in-context learning in large language models operates when demonstrations use irrelevant or misleading label words. It proposes a two-function causal decomposition: an inference function that derives an answer representation and a verbalization function that maps that representation to the demonstrated label space, with the inference function invariant to label remappings. Using interchange interventions, the authors demonstrate that these functions can be localized in separate, consistent layers across multiple models and tasks, supporting a mechanistic account of ICL. The findings show middle-layer interventions reliably induce counterfactual outputs, suggesting robust transfer of information between the two functions, and generalize across datasets like MultiNLI, RTE, ANLI, IMDb, and AGNews. Overall, the paper advances a causal, mechanistic understanding of ICL and offers a methodology for probing internal representations with potential implications for robust prompt design and model interpretability (,,).

Abstract

Large language models (LMs) are capable of in-context learning from a few demonstrations (example-label pairs) to solve new tasks during inference. Despite the intuitive importance of high-quality demonstrations, previous work has observed that, in some settings, ICL performance is minimally affected by irrelevant labels (Min et al., 2022). We hypothesize that LMs perform ICL with irrelevant labels via two sequential processes: an inference function that solves the task, followed by a verbalization function that maps the inferred answer to the label space. Importantly, we hypothesize that the inference function is invariant to remappings of the label space (e.g., "true"/"false" to "cat"/"dog"), enabling LMs to share the same inference function across settings with different label words. We empirically validate this hypothesis with controlled layer-wise interchange intervention experiments. Our findings confirm the hypotheses on multiple datasets and tasks (natural language inference, sentiment analysis, and topic classification) and further suggest that the two functions can be localized in specific layers across various open-sourced models, including GEMMA-7B, MISTRAL-7B-V0.3, GEMMA-2-27B, and LLAMA-3.1-70B.

Paper Structure

This paper contains 62 sections, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Summary of our hypotheses, method, and findings. We hypothesize that LMs perform ICL via two sequential functions: (1) an inference function that uses an input ICL context and returns a representation of the answer and 2) a verbalization function that maps the aforementioned answer representation to the output label space specified by the demonstrations. Furthermore, the inference function is invariant to remappings of the output label space. We empirically test our hypotheses by conducting interchange intervention experiments (i.e., fixing one of the functions while modifying the other).
  • Figure 2: Intervention experiments with remapped label spaces. (1, top): First, we induce changes in the verbalization function by remapping the output label space in the demonstrations (e.g., "true"/"false" to "cat"/"dog"). (2): Then, we intervene on the verbalization function by replacing the representation of the intervened model (prompted with the remapped label space "cat"/"dog") at layer $l$ with the representation from the source model (prompted with the default label space "true"/"false") at the same layer. (3): We evaluate the intervention effects by measuring how often the intervened output $Y"$ matches the hypothetical counterfactualoutput$Y_c$. (4, bottom): Distribution of outputs predicted by Gemma-7b on MultiNLI when intervening with the remapped label space "true"/"false" $\rightarrow$ "cat"/"dog". For clarity, we visualize the subset examples where the hypothetical counterfactualoutput is "cat". We observe that the rate at which the intervened output matches the counterfactualoutput peaks around layers 18-20, localizing the verbalization function and suggesting the inference function is invariant to label remapping.
  • Figure 3: NLI experiments with remapped label spaces. The y-axis is the rate of the output label flipped into the hypothetical counterfactualoutput label ("flip rate") and the x-axis is the intervened layer. Each curve represents the flip rates observed when the intervened model is prompted with the corresponding label space. The background bars represent the flip rates of the default $\gets$ default baseline, where the intervened model is also prompted with the default label space ("true"/"false"). Due to their unsatisfactory ICL performance, we do not conduct intervention experiments with the two 7-billion models on ANLI (see more discussion on this in Section \ref{['realize-scenario']}).
  • Figure 4: Experiment with remapped label spaces on IMDb and AGNews. We perform the same experiments with remapped label spaces on tasks other than NLI. Different sets of label spaces are used for IMDb and AGNews to accommodate the semantic variations of the label words.
  • Figure 5: Experiment with reconstructed tasks on MultiNLI. Curves represent the flip rates of different settings where the intervened model is prompted with alternative tasks. The intervened model always sees the default task. The background bars represent the flip rates of default $\gets$ default baseline, where both the intervened model and the base model are prompted with the default task (NLI).
  • ...and 9 more figures