Table of Contents
Fetching ...

Evaluating Saliency Explanations in NLP by Crowdsourcing

Xiaotian Lu, Jiyi Li, Zhen Wan, Xiaofeng Lin, Koh Takeuchi, Hisashi Kashima

TL;DR

This work addresses the evaluation of saliency explanations in NLP by introducing a crowdsourced, human-centered methodology. It systematically compares seven saliency methods on two datasets (IMDB and AGNEWS) using top-k word explanations and majority voting to measure interpretability from human judgments. Integrated Gradients emerges as the strongest method for aligning with human explanations, while some automated metrics fail to predict human-perceived usefulness, and a Flip phenomenon reveals counterintuitive effects when more words are shown. The study highlights the importance of human evaluation in interpretability research, provides instance-level crowd data and reproducible code, and points to future work on mitigating Flip and extending beyond a single model or dataset.

Abstract

Deep learning models have performed well on many NLP tasks. However, their internal mechanisms are typically difficult for humans to understand. The development of methods to explain models has become a key issue in the reliability of deep learning models in many important applications. Various saliency explanation methods, which give each feature of input a score proportional to the contribution of output, have been proposed to determine the part of the input which a model values most. Despite a considerable body of work on the evaluation of saliency methods, whether the results of various evaluation metrics agree with human cognition remains an open question. In this study, we propose a new human-based method to evaluate saliency methods in NLP by crowdsourcing. We recruited 800 crowd workers and empirically evaluated seven saliency methods on two datasets with the proposed method. We analyzed the performance of saliency methods, compared our results with existing automated evaluation methods, and identified notable differences between NLP and computer vision (CV) fields when using saliency methods. The instance-level data of our crowdsourced experiments and the code to reproduce the explanations are available at https://github.com/xtlu/lreccoling_evaluation.

Evaluating Saliency Explanations in NLP by Crowdsourcing

TL;DR

This work addresses the evaluation of saliency explanations in NLP by introducing a crowdsourced, human-centered methodology. It systematically compares seven saliency methods on two datasets (IMDB and AGNEWS) using top-k word explanations and majority voting to measure interpretability from human judgments. Integrated Gradients emerges as the strongest method for aligning with human explanations, while some automated metrics fail to predict human-perceived usefulness, and a Flip phenomenon reveals counterintuitive effects when more words are shown. The study highlights the importance of human evaluation in interpretability research, provides instance-level crowd data and reproducible code, and points to future work on mitigating Flip and extending beyond a single model or dataset.

Abstract

Deep learning models have performed well on many NLP tasks. However, their internal mechanisms are typically difficult for humans to understand. The development of methods to explain models has become a key issue in the reliability of deep learning models in many important applications. Various saliency explanation methods, which give each feature of input a score proportional to the contribution of output, have been proposed to determine the part of the input which a model values most. Despite a considerable body of work on the evaluation of saliency methods, whether the results of various evaluation metrics agree with human cognition remains an open question. In this study, we propose a new human-based method to evaluate saliency methods in NLP by crowdsourcing. We recruited 800 crowd workers and empirically evaluated seven saliency methods on two datasets with the proposed method. We analyzed the performance of saliency methods, compared our results with existing automated evaluation methods, and identified notable differences between NLP and computer vision (CV) fields when using saliency methods. The instance-level data of our crowdsourced experiments and the code to reproduce the explanations are available at https://github.com/xtlu/lreccoling_evaluation.
Paper Structure (11 sections, 2 equations, 4 figures, 9 tables)

This paper contains 11 sections, 2 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Example of a real task for crowd workers. The top-$10$ words of a negative review given by a saliency method are shown. A "." corresponds to a hidden word. All punctuation and special tokens of the original text are ignored. Workers were asked to choose whether the review was likely to be negative, positive, or indeterminate based on the words shown. In this example, because the word "Poor" was included, this review may be inferred to be negative. Workers may obtain additional information or make inferences based on the position of words, which is consistent with the position encoding of the model. "."s prevent words that are far away from each other in the original text from being visually close, and will not mislead workers.
  • Figure 2: An example lu2021crowdsourcing of top-$5\%$ / $10\%$ / $20\%$ important pixels given by vision saliency method GradCAM selvaraju2017grad. Regardless of the saliency method, once humans can recognize that the images with fewer pixels show a cat, it is almost impossible for us to fail to recognize the same cat image with more pixels. However, there are two exceptions. The first is that the image is ambiguous. For example, in the cat/dog classification problem, an image may include both a cat and a dog. Another issue is rarer, that is, the position of important pixels can express the outline of an object's shape, such as by showing an outline of a cat's head against a gray sky gupta2022new.
  • Figure 3: Histograms of Flips in the two datasets. The horizontal and vertical axes indicate the number of saliency methods where a sample Fliped with and the number of samples, respectively. For example, over 30 samples among 100 samples in the AGNEWS dataset did not Flip with all the $8$ saliency methods (including the random baseline). Regardless of saliency methods, some samples are easier to Flip, i.e., contain more misleading words.
  • Figure 4: An example of the top five important words given by seven different saliency methods for a negative movie review in the IMDB dataset. A word with a mark (on the top of a word) represents as one of the top five important by a saliency method; marks with different colors and shapes represent different saliency methods. For example, the top five important words given by LIME consist of "sense", "woman", "for", "about", and "couple". The word "dull" is selected by five saliency methods. Even when some words are the same, different saliency methods are inconsistent with each other.