Table of Contents
Fetching ...

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

Matthew Kowal, Goncalo Paulo, Louis Jaburi, Tom Tseng, Lev E McKinney, Stefan Heimersheim, Aaron David Tucker, Adam Gleave, Kellin Pelrine

TL;DR

Concept Influence is introduced which attribute model behavior to semantic directions (such as linear probes or sparse autoencoder features) rather than individual test examples rather than individual test examples, and simple probe-based attribution methods are first-order approximations of Concept Influence that achieve comparable performance while being over an order-of-magnitude faster.

Abstract

As large language models are increasingly trained and fine-tuned, practitioners need methods to identify which training data drive specific behaviors, particularly unintended ones. Training Data Attribution (TDA) methods address this by estimating datapoint influence. Existing approaches like influence functions are both computationally expensive and attribute based on single test examples, which can bias results toward syntactic rather than semantic similarity. To address these issues of scalability and influence to abstract behavior, we leverage interpretable structures within the model during the attribution. First, we introduce Concept Influence which attribute model behavior to semantic directions (such as linear probes or sparse autoencoder features) rather than individual test examples. Second, we show that simple probe-based attribution methods are first-order approximations of Concept Influence that achieve comparable performance while being over an order-of-magnitude faster. We empirically validate Concept Influence and approximations across emergent misalignment benchmarks and real post-training datasets, and demonstrate they achieve comparable performance to classical influence functions while being substantially more scalable. More broadly, we show that incorporating interpretable structure within traditional TDA pipelines can enable more scalable, explainable, and better control of model behavior through data.

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

TL;DR

Concept Influence is introduced which attribute model behavior to semantic directions (such as linear probes or sparse autoencoder features) rather than individual test examples rather than individual test examples, and simple probe-based attribution methods are first-order approximations of Concept Influence that achieve comparable performance while being over an order-of-magnitude faster.

Abstract

As large language models are increasingly trained and fine-tuned, practitioners need methods to identify which training data drive specific behaviors, particularly unintended ones. Training Data Attribution (TDA) methods address this by estimating datapoint influence. Existing approaches like influence functions are both computationally expensive and attribute based on single test examples, which can bias results toward syntactic rather than semantic similarity. To address these issues of scalability and influence to abstract behavior, we leverage interpretable structures within the model during the attribution. First, we introduce Concept Influence which attribute model behavior to semantic directions (such as linear probes or sparse autoencoder features) rather than individual test examples. Second, we show that simple probe-based attribution methods are first-order approximations of Concept Influence that achieve comparable performance while being over an order-of-magnitude faster. We empirically validate Concept Influence and approximations across emergent misalignment benchmarks and real post-training datasets, and demonstrate they achieve comparable performance to classical influence functions while being substantially more scalable. More broadly, we show that incorporating interpretable structure within traditional TDA pipelines can enable more scalable, explainable, and better control of model behavior through data.
Paper Structure (33 sections, 18 equations, 16 figures, 1 table)

This paper contains 33 sections, 18 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: Standard influence functions commonly attribute influence to a single output query, may fail to identify the semantic information the user is interested in, and often returns data that has similar in only syntactic or other undesired ways. We propose to calculate influence to interpretable components (i.e., probe vectors, SAEs, etc) instead. This allows users to define the target concept a priori, and results in better quantitative results and in data points that are qualitatively more similar to the desired concept (e.g., "Evil"). Moreover, we show approximations to these components can be orders of magnitude faster with competitive or better performance.
  • Figure 2: Filtering out datasets causing emergent misalignment (EM) and retraining. We finetune Qwen2.5-7B on four EM datasets (Misaligned Opinions, Bad Medical Advice, Insecure Code, and GSM8k Mistakes) and evaluate the 'evilness' before and after using an LLM judge. We then use four different data attribution methods to try and remove (Remove Most) or increase (Remove Least) the evilness of the model. Across all datasets, Concept Influence performs the best, while efficient approximations achieve comparable performance to influence functions in many settings.
  • Figure 3: Correlation of influence scores between the four methods across the four emergent misalignment datasets. Broadly, we observe higher correlation groups across (i) vector-based methods and (ii) gradient-based methods suggesting two different notions of influence are being captured.
  • Figure 4: Most influential SAE features to the "evil" persona trait for the sampled test query for Qwen2.5-7B finetuned on the misaligned opinions dataset. Red and blue bars indicate the total influence coming from Evil and normal data, respectively. Influence Functions (top) surface generic concepts (legal terms, tax relief, culinary experiences) mentioned in the query, but unrelated to the target trait of interest. Concept Influence (bottom) reveal semantically relevant features—historical oppression, conspiracy theories, criminality, and societal critique—that are predominantly influenced by the evil-aligned fine-tuning data (red).
  • Figure 5: Filtering out harmful data when post-training Qwen2.5-7B on the Open Assistant v1 (OASST1) dataset kopf2023openassistant. Supervised finetuning on OASST1 improves the instruction following, according to the MTBench dataset kwan2024mt, from 38% to 67% but also results in harmful scores (according to an LLM judge) increasing by $\approx 2\%$. Efficient filtering methods (Vector Filter and Projection Difference) produce comparable results while being an order of magnitude more efficient.
  • ...and 11 more figures