DETAIL: Task DEmonsTration Attribution for Interpretable In-context Learning
Zijian Zhou, Xiaoqiang Lin, Xinyi Xu, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low
TL;DR
This work introduces DETAIL, an influence-function-based attribution method for in-context learning that treats transformers as implementing an internal kernelized regression during demonstrations. By formulating the impact of each demonstration on a query through a closed-form, ridge-regularized kernel regression with an internal representation m(x), DETAIL enables fast, order-aware attribution and supports self- and test-influence calculations. The approach includes efficiency-boosting random projections and demonstrates practical benefits in demonstration perturbation, noisy-demon detection, and demonstration curation, with applications to on-device white-box LLMs and transferable insights to black-box models like GPT-3.5. The results indicate that DETAIL can improve ICL performance and reliability while offering interpretable, transferable attribution, highlighting its potential to guide demonstration selection and prompt design in real-world settings.
Abstract
In-context learning (ICL) allows transformer-based language models that are pre-trained on general text to quickly learn a specific task with a few "task demonstrations" without updating their parameters, significantly boosting their flexibility and generality. ICL possesses many distinct characteristics from conventional machine learning, thereby requiring new approaches to interpret this learning paradigm. Taking the viewpoint of recent works showing that transformers learn in context by formulating an internal optimizer, we propose an influence function-based attribution technique, DETAIL, that addresses the specific characteristics of ICL. We empirically verify the effectiveness of our approach for demonstration attribution while being computationally efficient. Leveraging the results, we then show how DETAIL can help improve model performance in real-world scenarios through demonstration reordering and curation. Finally, we experimentally prove the wide applicability of DETAIL by showing our attribution scores obtained on white-box models are transferable to black-box models in improving model performance.
