Table of Contents
Fetching ...

DISCO: DISCovering Overfittings as Causal Rules for Text Classification Models

Zijian Zhang, Vinay Setty, Yumeng Wang, Avishek Anand

TL;DR

DISCO is introduced, a novel method for discovering global, rule-based explanations by identifying causal n-gram associations with model predictions by employing a scalable sequence mining technique to extract relevant text spans from training data, associate them with model predictions, and conduct causality checks to distill robust rules that elucidate model behavior.

Abstract

With the rapid advancement of neural language models, the deployment of over-parameterized models has surged, increasing the need for interpretable explanations comprehensible to human inspectors. Existing post-hoc interpretability methods, which often focus on unigram features of single input textual instances, fail to capture the models' decision-making process fully. Additionally, many methods do not differentiate between decisions based on spurious correlations and those based on a holistic understanding of the input. Our paper introduces DISCO, a novel method for discovering global, rule-based explanations by identifying causal n-gram associations with model predictions. This method employs a scalable sequence mining technique to extract relevant text spans from training data, associate them with model predictions, and conduct causality checks to distill robust rules that elucidate model behavior. These rules expose potential overfitting and provide insights into misleading feature combinations. We validate DISCO through extensive testing, demonstrating its superiority over existing methods in offering comprehensive insights into complex model behaviors. Our approach successfully identifies all shortcuts manually introduced into the training data (100% detection rate on the MultiRC dataset), resulting in an 18.8% regression in model performance -- a capability unmatched by any other method. Furthermore, DISCO supports interactive explanations, enabling human inspectors to distinguish spurious causes in the rule-based output. This alleviates the burden of abundant instance-wise explanations and helps assess the model's risk when encountering out-of-distribution (OOD) data.

DISCO: DISCovering Overfittings as Causal Rules for Text Classification Models

TL;DR

DISCO is introduced, a novel method for discovering global, rule-based explanations by identifying causal n-gram associations with model predictions by employing a scalable sequence mining technique to extract relevant text spans from training data, associate them with model predictions, and conduct causality checks to distill robust rules that elucidate model behavior.

Abstract

With the rapid advancement of neural language models, the deployment of over-parameterized models has surged, increasing the need for interpretable explanations comprehensible to human inspectors. Existing post-hoc interpretability methods, which often focus on unigram features of single input textual instances, fail to capture the models' decision-making process fully. Additionally, many methods do not differentiate between decisions based on spurious correlations and those based on a holistic understanding of the input. Our paper introduces DISCO, a novel method for discovering global, rule-based explanations by identifying causal n-gram associations with model predictions. This method employs a scalable sequence mining technique to extract relevant text spans from training data, associate them with model predictions, and conduct causality checks to distill robust rules that elucidate model behavior. These rules expose potential overfitting and provide insights into misleading feature combinations. We validate DISCO through extensive testing, demonstrating its superiority over existing methods in offering comprehensive insights into complex model behaviors. Our approach successfully identifies all shortcuts manually introduced into the training data (100% detection rate on the MultiRC dataset), resulting in an 18.8% regression in model performance -- a capability unmatched by any other method. Furthermore, DISCO supports interactive explanations, enabling human inspectors to distinguish spurious causes in the rule-based output. This alleviates the burden of abundant instance-wise explanations and helps assess the model's risk when encountering out-of-distribution (OOD) data.

Paper Structure

This paper contains 19 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 3: Rules' agreement scores with LIME and ExPred. For MultiRC, we only consider patterns mined from its documents, excluding those from its queries.
  • Figure 4: Results of RQ2 on four datasets under different contamination-bias settings. Each column corresponds to a specific dataset. The heat maps in the first row depict the retention rate. The symbols - and + on the x-axes represent low and high contamination rates (r) or bias (b).
  • Figure 5: Fleiss' $\kappa$ among human evaluators considering whether the rules are right for the wrong reasons