Influence-based Attributions can be Manipulated

Chhavi Yadav; Ruihan Wu; Kamalika Chaudhuri

Influence-based Attributions can be Manipulated

Chhavi Yadav, Ruihan Wu, Kamalika Chaudhuri

TL;DR

This work presents realistic incentives to manipulate influence-based attributions and investigates whether these attributions can be tampered by an adversary, showing that this is indeed possible for logistic regression models trained on ResNet feature embeddings and standard tabular fairness datasets.

Abstract

Influence Functions are a standard tool for attributing predictions to training data in a principled manner and are widely used in applications such as data valuation and fairness. In this work, we present realistic incentives to manipulate influence-based attributions and investigate whether these attributions can be \textit{systematically} tampered by an adversary. We show that this is indeed possible for logistic regression models trained on ResNet feature embeddings and standard tabular fairness datasets and provide efficient attacks with backward-friendly implementations. Our work raises questions on the reliability of influence-based attributions in adversarial circumstances. Code is available at : \url{https://github.com/infinite-pursuits/influence-based-attributions-can-be-manipulated}

Influence-based Attributions can be Manipulated

TL;DR

Abstract

Paper Structure (17 sections, 1 theorem, 11 equations, 8 figures, 4 tables, 3 algorithms)

This paper contains 17 sections, 1 theorem, 11 equations, 8 figures, 4 tables, 3 algorithms.

Introduction
Preliminaries
General Threat Model
Downstream Application 1: Data Valuation
Data Valuation Experiments
Downstream Application 2: Fairness
Fairness Manipulation Experiments
Discussion on Susceptibility and Defense
Related Work
Conclusion & Future Work
Appendix
Auditing the Influence Calculator by Supplying Test Data
Data Manipulation Attack Details
Efficient Backward Pass Algorithm
Experimental details
...and 2 more sections

Key Result

Theorem 1

For a logistic regression family of models and any target influence ranking $k\in\mathbb{N}$, there exists a training set $Z_{\rm train}$, test set $Z_{\rm test}$ and target sample $z_{\rm target} \in Z_{\rm train}$, such that no model in the family can have the target sample $z_{\rm target}$ in top

Figures (8)

Figure 1: Threat Model. Data Provider provides training data. Influence Calculator trains a model and computes influence scores for the training data on the trained model and a test set. It outputs both the trained model and the resulting influence scores, which are used for a downstream application such as data valuation or fairness. Adversarial manipulation happens in the model training process, which trains a malicious model to achieve desired influence scores, while maintaining similar accuracy as the honest model.
Figure 2: Behavior and Transfer results for Single-Target Attack in the Data Valuation use-case. Value of manipulation radius $C$ (Eq.\ref{['eq:att1']}) increases from left to right in each curve. (1) Behavior on original test set (solid lines) : As manipulation radius $C$ increases, manipulated model accuracy drops while attack success rate increases. (2) Transfer on an unknown test set (dashed lines): Success rate on an unknown test set gets better with increasing values of ranking $k$.
Figure 3: Performance of Multi-Target Attack in the Data Valuation use-case. Results for the high-accuracy regime. Success Rates are higher when target set size is greater than the desired ranking $k$.
Figure 4: Histograms for original ranks of easy-to-manipulate samples (L), that of hard-to-manipulate samples (M), scatterplots for influence gradient norm vs. original ranks of (R) 50 random target samples. Ranking $k:=1$. For other datasets, see App. Fig.\ref{['app:fig:easyvshard']}.
Figure 5: Scaling attack for the Fairness use-case. Demographic Parity Gap of post-attack downstream models is higher than that of those w/o attack while test accuracies are comparable. This implies that post-attack downstream models are less fair than those w/o attacks. Scaling coefficients in log scale.
...and 3 more figures

Theorems & Definitions (3)

Definition 1: Influence Function koh2017understanding
Theorem 1
Definition 2: Demographic Parity Gap (DP) dwork2012fairness

Influence-based Attributions can be Manipulated

TL;DR

Abstract

Influence-based Attributions can be Manipulated

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (3)