Table of Contents
Fetching ...

Evaluating Webcam-based Gaze Data as an Alternative for Human Rationale Annotations

Stephanie Brandl, Oliver Eberle, Tiago Ribeiro, Anders Søgaard, Nora Hollenstein

TL;DR

It is suggested that gaze data offers valuable linguistic insights that could be leveraged to infer task difficulty and further show a comparable ranking of explainability methods to that of human rationales.

Abstract

Rationales in the form of manually annotated input spans usually serve as ground truth when evaluating explainability methods in NLP. They are, however, time-consuming and often biased by the annotation process. In this paper, we debate whether human gaze, in the form of webcam-based eye-tracking recordings, poses a valid alternative when evaluating importance scores. We evaluate the additional information provided by gaze data, such as total reading times, gaze entropy, and decoding accuracy with respect to human rationale annotations. We compare WebQAmGaze, a multilingual dataset for information-seeking QA, with attention and explainability-based importance scores for 4 different multilingual Transformer-based language models (mBERT, distil-mBERT, XLMR, and XLMR-L) and 3 languages (English, Spanish, and German). Our pipeline can easily be applied to other tasks and languages. Our findings suggest that gaze data offers valuable linguistic insights that could be leveraged to infer task difficulty and further show a comparable ranking of explainability methods to that of human rationales.

Evaluating Webcam-based Gaze Data as an Alternative for Human Rationale Annotations

TL;DR

It is suggested that gaze data offers valuable linguistic insights that could be leveraged to infer task difficulty and further show a comparable ranking of explainability methods to that of human rationales.

Abstract

Rationales in the form of manually annotated input spans usually serve as ground truth when evaluating explainability methods in NLP. They are, however, time-consuming and often biased by the annotation process. In this paper, we debate whether human gaze, in the form of webcam-based eye-tracking recordings, poses a valid alternative when evaluating importance scores. We evaluate the additional information provided by gaze data, such as total reading times, gaze entropy, and decoding accuracy with respect to human rationale annotations. We compare WebQAmGaze, a multilingual dataset for information-seeking QA, with attention and explainability-based importance scores for 4 different multilingual Transformer-based language models (mBERT, distil-mBERT, XLMR, and XLMR-L) and 3 languages (English, Spanish, and German). Our pipeline can easily be applied to other tasks and languages. Our findings suggest that gaze data offers valuable linguistic insights that could be leveraged to infer task difficulty and further show a comparable ranking of explainability methods to that of human rationales.
Paper Structure (34 sections, 9 figures, 1 table)

This paper contains 34 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: One sample from the WebQAmGaze corpus with ground-truth rationales, average eye-tracking pattern across participants and model-based relevance scores computed with LRP based on mBERT. The correct answer is shown in the rationale (upper). We see that both the gaze pattern and the model-based explanation scores focus on the first part of the answer more than on the second.
  • Figure 2: Toy example to visualize decoding accuracies (ROC-AUC scores) of ground-truth rationales for three different eye-tracking patterns (v1-v3). The correct ranking as in v1 leads to a perfect score of $1$. In v2 only one of the correct tokens (Narges) appears in the top-2 of the reading patterns which leads to a lower ROC-AUC score as shown on the right, similar for v3 where the relevant tokens only appear within the top-5. For the analysis with real gaze patterns, we only use one pattern per text in each set after averaging across participants.
  • Figure 3: Entropy and decoding accuracy separated by all languages. Medians are displayed within the boxplots as a straight line whereas means are shown as white dots. Data has been filtered based on the WebGazer accuracy with a threshold of 20% (orange) and additionally we removed wrong answers (purple).
  • Figure 4: ROC-AUC scores for decoding rationales from attention-based and gradient-based model explanations, i.e., decoding accuracies, across all 3 languages. Results for Gaze are model-agnostic. Individual samples with an F1-scores below 50 have been filtered out per model and language.
  • Figure 5: Comparison of gaze-based and rationale-based ranking of explanation methods for English (EN), Spanish (ES), and German (DE) -- top to bottom. Ranks 1 to 5 indicate model explanations most to least aligned with human importance scores. Spearman rank correlation $r_s$ at $p\leq0.01$ ($^{**}$), $p\leq0.05$ ($^{*}$), or not significant (ns). Results are based on text samples filtered by correct human answers.
  • ...and 4 more figures