Table of Contents
Fetching ...

DETAIL: Task DEmonsTration Attribution for Interpretable In-context Learning

Zijian Zhou, Xiaoqiang Lin, Xinyi Xu, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low

TL;DR

This work introduces DETAIL, an influence-function-based attribution method for in-context learning that treats transformers as implementing an internal kernelized regression during demonstrations. By formulating the impact of each demonstration on a query through a closed-form, ridge-regularized kernel regression with an internal representation m(x), DETAIL enables fast, order-aware attribution and supports self- and test-influence calculations. The approach includes efficiency-boosting random projections and demonstrates practical benefits in demonstration perturbation, noisy-demon detection, and demonstration curation, with applications to on-device white-box LLMs and transferable insights to black-box models like GPT-3.5. The results indicate that DETAIL can improve ICL performance and reliability while offering interpretable, transferable attribution, highlighting its potential to guide demonstration selection and prompt design in real-world settings.

Abstract

In-context learning (ICL) allows transformer-based language models that are pre-trained on general text to quickly learn a specific task with a few "task demonstrations" without updating their parameters, significantly boosting their flexibility and generality. ICL possesses many distinct characteristics from conventional machine learning, thereby requiring new approaches to interpret this learning paradigm. Taking the viewpoint of recent works showing that transformers learn in context by formulating an internal optimizer, we propose an influence function-based attribution technique, DETAIL, that addresses the specific characteristics of ICL. We empirically verify the effectiveness of our approach for demonstration attribution while being computationally efficient. Leveraging the results, we then show how DETAIL can help improve model performance in real-world scenarios through demonstration reordering and curation. Finally, we experimentally prove the wide applicability of DETAIL by showing our attribution scores obtained on white-box models are transferable to black-box models in improving model performance.

DETAIL: Task DEmonsTration Attribution for Interpretable In-context Learning

TL;DR

This work introduces DETAIL, an influence-function-based attribution method for in-context learning that treats transformers as implementing an internal kernelized regression during demonstrations. By formulating the impact of each demonstration on a query through a closed-form, ridge-regularized kernel regression with an internal representation m(x), DETAIL enables fast, order-aware attribution and supports self- and test-influence calculations. The approach includes efficiency-boosting random projections and demonstrates practical benefits in demonstration perturbation, noisy-demon detection, and demonstration curation, with applications to on-device white-box LLMs and transferable insights to black-box models like GPT-3.5. The results indicate that DETAIL can improve ICL performance and reliability while offering interpretable, transferable attribution, highlighting its potential to guide demonstration selection and prompt design in real-world settings.

Abstract

In-context learning (ICL) allows transformer-based language models that are pre-trained on general text to quickly learn a specific task with a few "task demonstrations" without updating their parameters, significantly boosting their flexibility and generality. ICL possesses many distinct characteristics from conventional machine learning, thereby requiring new approaches to interpret this learning paradigm. Taking the viewpoint of recent works showing that transformers learn in context by formulating an internal optimizer, we propose an influence function-based attribution technique, DETAIL, that addresses the specific characteristics of ICL. We empirically verify the effectiveness of our approach for demonstration attribution while being computationally efficient. Leveraging the results, we then show how DETAIL can help improve model performance in real-world scenarios through demonstration reordering and curation. Finally, we experimentally prove the wide applicability of DETAIL by showing our attribution scores obtained on white-box models are transferable to black-box models in improving model performance.
Paper Structure (48 sections, 1 theorem, 9 equations, 18 figures, 8 tables, 1 algorithm)

This paper contains 48 sections, 1 theorem, 9 equations, 18 figures, 8 tables, 1 algorithm.

Key Result

Theorem C.1

For any $0 < \epsilon < 1$ and any integer $n$, let $d'$ be a positive integer such that then for any set $A$ of $n$ points $\in \mathbb{R}^d$, there exists a mapping $f: \mathbb{R}^d \to \mathbb{R}^{d'}$ such that for all $x_i, x_j \in A$,

Figures (18)

  • Figure 1: Illustration of computing DETAIL score for transformer-based ICL. Note that we use the same notation $m_{p[\cdot]}$ before and after the random projection since the projection is optional.
  • Figure 2: (Left) Visualization of learning label mapping of MNIST digits in context. The left $9$ images in each row are demonstrations while the right-most one is a query image. Below each image shows its mapped label ("A" to "E"). Above each ICL image is its $\mathcal{I}_{\text{test}}$ w.r.t. the query image with high values highlighted in green and low values highlighted in red. Above the query image is the prediction (pred) made by the pre-trained transformer which is in green if consistent with the ground truth (GT) and red otherwise. Top row shows that using all $9$ demonstrations allows the transformer to learn the mapping in context as GT$=$pred$=$"E". Middle shows removing $5$ demonstrations with the highest $\mathcal{I}_{\text{test}}$ results in most digit $0$'s removed, leading to a wrong prediction. Bottom shows removing $5$ demonstrations with the lowest $\mathcal{I}_{\text{test}}$ results in $3$ digit $0$'s remaining for the transformer to learn in context, leading to correct prediction. (Right) Average accuracy on $1013$ ICL datasets repeated over $10$ trials; $\lambda=0.01$; Lines and shades represent mean and standard error over $10$ independent trials.
  • Figure 3: ($1$st and $2$nd) Corrupting labels of demonstrations and ($3$rd and $4$th) removing demonstrations with high/low DETAIL scores ($\mathcal{I}_{\text{test}}$) on AG News. Perturbing demonstrations randomly result in an accuracy in the middle as expected. All experiments are repeated $10$ trials. $\lambda=1.0$. Lines and shades represent the mean and standard error respectively.
  • Figure 3: Accuracy (on GPT-3.5) of demonstrations (demos) permuted randomly and based on $\mathcal{I}_{\text{self}}$. Mean and std. error (in bracket) with $80$ trials is shown.
  • Figure 4: ($1$st and $2$nd) Fraction of noisy labels identified vs. number of demonstrations ranked by DETAIL (with $d' = 1000$) and LOO checked on Subj using Vicuna-7b and Llama-2-13b respectively. ($3$rd) Wall time comparison between DETAIL and LOO on all datasets. ($4$th) wall time in seconds (left $y$-axis) and AUCROC (right $y$-axis) vs. projection dimension on Subj using Vicuna-7b. All experiments are repeated $10$ trials. $\lambda=10^{-9}$. Lines and shades represent the mean and std. error.
  • ...and 13 more figures

Theorems & Definitions (1)

  • Theorem C.1: Johnson-Lindenstrauss Lemma