Table of Contents
Fetching ...

Bridging Interpretability and Robustness Using LIME-Guided Model Refinement

Navid Nayyem, Abdullah Rakin, Longwei Wang

TL;DR

This work addresses the intertwined problems of interpretability and robustness in deep CNNs by introducing a LIME-guided refinement framework. It treats LIME explanations as an active intervention signal, pinpointing spurious dependencies and guiding iterative model refinements through feature masking, sensitivity regularization, and adversarial training. Empirical results on CIFAR-10, CIFAR-100, and CIFAR-10C show that the refined models achieve substantially improved adversarial robustness and out-of-distribution generalization, with a modest trade-off in clean accuracy. The approach demonstrates a practical pathway to more resilient and transparent neural networks by leveraging localized explanations to steer robust feature learning.

Abstract

This paper explores the intricate relationship between interpretability and robustness in deep learning models. Despite their remarkable performance across various tasks, deep learning models often exhibit critical vulnerabilities, including susceptibility to adversarial attacks, over-reliance on spurious correlations, and a lack of transparency in their decision-making processes. To address these limitations, we propose a novel framework that leverages Local Interpretable Model-Agnostic Explanations (LIME) to systematically enhance model robustness. By identifying and mitigating the influence of irrelevant or misleading features, our approach iteratively refines the model, penalizing reliance on these features during training. Empirical evaluations on multiple benchmark datasets demonstrate that LIME-guided refinement not only improves interpretability but also significantly enhances resistance to adversarial perturbations and generalization to out-of-distribution data.

Bridging Interpretability and Robustness Using LIME-Guided Model Refinement

TL;DR

This work addresses the intertwined problems of interpretability and robustness in deep CNNs by introducing a LIME-guided refinement framework. It treats LIME explanations as an active intervention signal, pinpointing spurious dependencies and guiding iterative model refinements through feature masking, sensitivity regularization, and adversarial training. Empirical results on CIFAR-10, CIFAR-100, and CIFAR-10C show that the refined models achieve substantially improved adversarial robustness and out-of-distribution generalization, with a modest trade-off in clean accuracy. The approach demonstrates a practical pathway to more resilient and transparent neural networks by leveraging localized explanations to steer robust feature learning.

Abstract

This paper explores the intricate relationship between interpretability and robustness in deep learning models. Despite their remarkable performance across various tasks, deep learning models often exhibit critical vulnerabilities, including susceptibility to adversarial attacks, over-reliance on spurious correlations, and a lack of transparency in their decision-making processes. To address these limitations, we propose a novel framework that leverages Local Interpretable Model-Agnostic Explanations (LIME) to systematically enhance model robustness. By identifying and mitigating the influence of irrelevant or misleading features, our approach iteratively refines the model, penalizing reliance on these features during training. Empirical evaluations on multiple benchmark datasets demonstrate that LIME-guided refinement not only improves interpretability but also significantly enhances resistance to adversarial perturbations and generalization to out-of-distribution data.

Paper Structure

This paper contains 40 sections, 12 equations, 15 figures.

Figures (15)

  • Figure 1: Iterative Steps in LIME-Guided Model Refinement
  • Figure 2: Accuracy Over Epochs for the LIME CIFAR10 Dataset. This plot compares the accuracy of the baseline and refined models across training epochs. The refined model consistently outperforms the baseline, demonstrating the effectiveness of the refinements.
  • Figure 3: Loss Over Epochs for the LIME CIFAR10 Dataset. This plot highlights the loss trends of the baseline and refined models during training. The refined model achieves a significantly lower loss compared to the baseline, indicating better convergence.
  • Figure 4: Comparison of Baseline and Refined Accuracy under FGSM Attack on the CIFAR-10 Dataset. The x-axis represents the epsilon values, and the y-axis shows the corresponding accuracies. The refined model exhibits superior robustness across all epsilon values, with a notable improvement at lower perturbation levels.
  • Figure 5: Comparison of Baseline and Refined Accuracy under PGD Attack on the CIFAR-10 Dataset. This plot highlights the refined model's enhanced resilience to adversarial noise, with significant improvements in accuracy compared to the baseline across all epsilon values.
  • ...and 10 more figures