Table of Contents
Fetching ...

Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors

Peiyu Yang, Naveed Akhtar, Jiantong Jiang, Ajmal Mian

Abstract

The performance of neural network models deteriorates due to their unreliable behavior on non-robust features of corrupted samples. Owing to their opaque nature, rectifying models to address this problem often necessitates arduous data cleaning and model retraining, resulting in huge computational and manual overhead. In this work, we leverage rank-one model editing to establish an attribution-guided model rectification framework that effectively locates and corrects model unreliable behaviors. We first distinguish our rectification setting from existing model editing, yielding a formulation that corrects unreliable behavior while preserving model performance and reducing reliance on large budgets of cleansed samples. We further reveal a bottleneck of model rectifying arising from heterogeneous editability across layers. To target the primary source of misbehavior, we introduce an attribution-guided layer localization method that quantifies layer-wise editability and identifies the layer most responsible for unreliabilities. Extensive experiments demonstrate the effectiveness of our method in correcting unreliabilities observed for neural Trojans, spurious correlations and feature leakage. Our method shows remarkable performance by achieving its editing objective with as few as a single cleansed sample, which makes it appealing for practice.

Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors

Abstract

The performance of neural network models deteriorates due to their unreliable behavior on non-robust features of corrupted samples. Owing to their opaque nature, rectifying models to address this problem often necessitates arduous data cleaning and model retraining, resulting in huge computational and manual overhead. In this work, we leverage rank-one model editing to establish an attribution-guided model rectification framework that effectively locates and corrects model unreliable behaviors. We first distinguish our rectification setting from existing model editing, yielding a formulation that corrects unreliable behavior while preserving model performance and reducing reliance on large budgets of cleansed samples. We further reveal a bottleneck of model rectifying arising from heterogeneous editability across layers. To target the primary source of misbehavior, we introduce an attribution-guided layer localization method that quantifies layer-wise editability and identifies the layer most responsible for unreliabilities. Extensive experiments demonstrate the effectiveness of our method in correcting unreliabilities observed for neural Trojans, spurious correlations and feature leakage. Our method shows remarkable performance by achieving its editing objective with as few as a single cleansed sample, which makes it appealing for practice.
Paper Structure (30 sections, 5 theorems, 22 equations, 12 figures, 10 tables, 1 algorithm)

This paper contains 30 sections, 5 theorems, 22 equations, 12 figures, 10 tables, 1 algorithm.

Key Result

Lemma 1

For $K = [k_1,...,k_d] \in \mathbb R^{n\times d}$ and $C= KK^{\top}$, when $k^* \not\in \text{span}(K)$, the projection $C^{-1}k^*$ leads to a residual component $C^{-1}r$ outside the span of $K$, measurable by a residual vector $r\in \mathbb{R}^n$.

Figures (12)

  • Figure 1: Given the original sample labeled as Agama, i.e., class $\textbf{y}$, the Trojaned model can correctly classify this sample. However, it misclassifies the poisoned sample containing a trigger as Tench, i.e., class $\tilde{\textbf{y}}$. Attribution maps with Pearson Correlation Coefficients (PCCs) and predictive confidence for the vanilla model, fine-tuned model, and model rectified with our approach are provided. Our method restores the correct label by assigning appropriate attributions to the correct object.
  • Figure 2: False confidence reduction rank after individually rectifying different layers of ResNet-18. A lower value indicates better results.
  • Figure 3: Model rectifying workflow. Step 1: Given a pair of clean and corrupted samples, map their prediction attributions for all internal layers. Step 2: Transformed attributions to emphasize editable parameters and locate the suspect layer. Step 3: Apply Rank-one model editing to establish a new key-value association in the located layer for behavior correction.
  • Figure 4: Comparison of model performance between fine-tuned models (FT) and rectified models (Ours). (a) The mitigation of false confidence changes with the number of used samples. (b) The mitigation of false confidence changes with the overall accuracy degradation (%) during model rectifying and fine-tuning. Results are computed for ResNet-18 on CIFAR-10 dataset.
  • Figure 5: BlockMNIST and feature leakage. (a) Null block randomly appended at the top or bottom of MNIST samples. (b&c) Integrated gradients estimated on benign and our rectified models.
  • ...and 7 more figures

Theorems & Definitions (10)

  • Lemma 1: Out-of-Span Residual
  • Lemma 2: Sample Complexity
  • Proposition 4.1: Rectifiability
  • Proposition 4.2: Span-Aligned Control
  • Lemma 3
  • proof : Proof of Lemma 1
  • proof : Proof of Lemma 2
  • proof : Proof of Proposition \ref{['prop1']}
  • proof : Proof of Proposition \ref{['prop2']}
  • proof : Proof of Lemma 3