Fast Training Dataset Attribution via In-Context Learning

Milad Fotouhi; Mohammad Taha Bahadori; Oluwaseyi Feyisetan; Payman Arabshahi; David Heckerman

Fast Training Dataset Attribution via In-Context Learning

Milad Fotouhi, Mohammad Taha Bahadori, Oluwaseyi Feyisetan, Payman Arabshahi, David Heckerman

TL;DR

This work tackles training data attribution (TDA) for instruction-tuned LLMs by leveraging in-context learning and prompt engineering. It introduces two complementary methods: SCM, a non-parametric Shapley-context approach, and CMF, a semi-parametric context mixture model cast as a matrix-factorization problem solved via alternating projected least squares. CMF demonstrates greater robustness to retrieval noise and yields more reliable attribution—capturing base-model contributions and dataset-specific effects without explicit latent distributions. Through extensive experiments on BoolQ, FakeQ, and Olympic2024 across multiple models, the authors show CMF and SCM can quantify dataset influence and evaluate unlearning techniques, with CMF offering favorable runtime and performance. The findings suggest practical impact for data curation, auditing, and robust data-influence assessment in real-world, retrieval-augmented systems.

Abstract

We investigate the use of in-context learning and prompt engineering to estimate the contributions of training data in the outputs of instruction-tuned large language models (LLMs). We propose two novel approaches: (1) a similarity-based approach that measures the difference between LLM outputs with and without provided context, and (2) a mixture distribution model approach that frames the problem of identifying contribution scores as a matrix factorization task. Our empirical comparison demonstrates that the mixture model approach is more robust to retrieval noise in in-context learning, providing a more reliable estimation of data contributions.

Fast Training Dataset Attribution via In-Context Learning

TL;DR

Abstract

Paper Structure (24 sections, 7 equations, 1 figure, 15 tables, 2 algorithms)

This paper contains 24 sections, 7 equations, 1 figure, 15 tables, 2 algorithms.

Introduction
Methodology
The Non-parametric Approach: The Shapley Context Method (SCM)
The Semi-Parametric Approach: Context Mixture Factorization (CMF)
Formulating as a Matrix Factorization Problem.
Alternating Projected Least Squares.
Implementation
Prompt Engineering
Using RAG
Experiments
Results and Analysis
Case Study: Evaluation of Unlearning Methods
Runtime Comparison
Validation of Attribution Metrics through Fine-Tuning
Deep Dive into RAG Noise Effect
...and 9 more sections

Figures (1)

Figure 1: (left) SCM Attribution Values vs. Learning Rate for Olympic2024 Dataset: Attribution values increase with fine-tuning. (right) CMF Attribution Values vs. Learning Rate for Olympic2024 Dataset: CMF shows higher attribution values, reflecting its robustness during fine-tuning.

Fast Training Dataset Attribution via In-Context Learning

TL;DR

Abstract

Fast Training Dataset Attribution via In-Context Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (1)