Table of Contents
Fetching ...

Mean Opinion Score as a New Metric for User-Evaluation of XAI Methods

Hyeon Yu, Jenny Benois-Pineau, Romain Bourqui, Romain Giot, Alexey Zhukov

TL;DR

This work addresses the evaluation gap for XAI explanations by introducing Mean Opinion Score ($MOS$) as a user-centric metric. It adapts MOS from image quality assessment and designs a psycho-visual protocol using distorted images and three feature-attribution explainers (Grad-CAM, FEM, MLFEM) with a ResNet-50 backbone on SALICON-derived data. MOS differentiates explainers, with $MLFEM$ performing best on well-classified images and Grad-CAM performing relatively better on poorly classified ones, while correlations with automatic metrics $IAUC$ and $DAUC$ are moderate. The study reveals that user-centered evaluation can diverge from automatic metrics, underscoring the need for multi-faceted evaluation of XAI explanations and further work to align disparate metrics.

Abstract

This paper investigates the use of Mean Opinion Score (MOS), a common image quality metric, as a user-centric evaluation metric for XAI post-hoc explainers. To measure the MOS, a user experiment is proposed, which has been conducted with explanation maps of intentionally distorted images. Three methods from the family of feature attribution methods - Gradient-weighted Class Activation Mapping (Grad-CAM), Multi-Layered Feature Explanation Method (MLFEM), and Feature Explanation Method (FEM) - are compared with this metric. Additionally, the correlation of this new user-centric metric with automatic metrics is studied via Spearman's rank correlation coefficient. MOS of MLFEM shows the highest correlation with automatic metrics of Insertion Area Under Curve (IAUC) and Deletion Area Under Curve (DAUC). However, the overall correlations are limited, which highlights the lack of consensus between automatic and user-centric metrics.

Mean Opinion Score as a New Metric for User-Evaluation of XAI Methods

TL;DR

This work addresses the evaluation gap for XAI explanations by introducing Mean Opinion Score () as a user-centric metric. It adapts MOS from image quality assessment and designs a psycho-visual protocol using distorted images and three feature-attribution explainers (Grad-CAM, FEM, MLFEM) with a ResNet-50 backbone on SALICON-derived data. MOS differentiates explainers, with performing best on well-classified images and Grad-CAM performing relatively better on poorly classified ones, while correlations with automatic metrics and are moderate. The study reveals that user-centered evaluation can diverge from automatic metrics, underscoring the need for multi-faceted evaluation of XAI explanations and further work to align disparate metrics.

Abstract

This paper investigates the use of Mean Opinion Score (MOS), a common image quality metric, as a user-centric evaluation metric for XAI post-hoc explainers. To measure the MOS, a user experiment is proposed, which has been conducted with explanation maps of intentionally distorted images. Three methods from the family of feature attribution methods - Gradient-weighted Class Activation Mapping (Grad-CAM), Multi-Layered Feature Explanation Method (MLFEM), and Feature Explanation Method (FEM) - are compared with this metric. Additionally, the correlation of this new user-centric metric with automatic metrics is studied via Spearman's rank correlation coefficient. MOS of MLFEM shows the highest correlation with automatic metrics of Insertion Area Under Curve (IAUC) and Deletion Area Under Curve (DAUC). However, the overall correlations are limited, which highlights the lack of consensus between automatic and user-centric metrics.
Paper Structure (26 sections, 2 equations, 4 figures, 4 tables)

This paper contains 26 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Weakly and strongly distorted images of SALICON dataset: additive Gaussian noise (left), Gaussian blur (center), and uniform random brightness shift (right)
  • Figure 2: Evaluation screen. From left to right: distorted image, importance heat map, ground truth class labels, model prediction result, and Likert scale for explanation quality.
  • Figure 3: Box Plot of MOS of Individuals in Each Group. The outliers' box-plots have been identified by the red circles. The first row shows the offline group and the second row shows the online group. From left-to-right: Grad-CAM, MLFEM, FEM.
  • Figure 4: MOS and the standard deviation of opinion scores of poorly classified and well-classified images. X-axis: map number, Y-axis: MOS.