Comparing Feature Importance and Rule Extraction for Interpretability on Text Data
Gianluigi Lopardo, Damien Garreau
TL;DR
This work tackles interpretability in natural language processing by comparing two popular local explanation methods: LIME, a perturbation-based feature-importance approach, and Anchors, a rule-based technique that yields simple, high-precision explanations. The authors adapt both methods to text data under TF-IDF representations and introduce a quantitative metric, the $ell_E$-index, based on Jaccard similarity to assess how well explanations recover the most influential words. Through qualitative and quantitative experiments on sentiment classification with simple models (e.g., small decision trees and sparse logistic models), they show that LIME and Anchors can produce substantially different explanations, with LIME generally better at capturing multiple informative words and Anchors often relying on single-word anchors influenced by word multiplicities. The study highlights the method-dependent nature of explanations, provides a concrete evaluation tool for cross-method comparison, and suggests practical guidance for selecting explainers in text applications.
Abstract
Complex machine learning algorithms are used more and more often in critical tasks involving text data, leading to the development of interpretability methods. Among local methods, two families have emerged: those computing importance scores for each feature and those extracting simple logical rules. In this paper we show that using different methods can lead to unexpectedly different explanations, even when applied to simple models for which we would expect qualitative coincidence. To quantify this effect, we propose a new approach to compare explanations produced by different methods.
