Table of Contents
Fetching ...

REVEX: A Unified Framework for Removal-Based Explainable Artificial Intelligence in Video

F. Xavier Gaya-Morey, Jose M. Buades-Rubio, I. Scott MacKenzie, Cristina Manresa-Yee

TL;DR

This work extends fine-grained explanation frameworks for computer vision data and adapts six existing techniques to video by adding temporal information and local explanations, and further examines the limitations of the employed XAI evaluation metrics and highlights their suitability in different applications.

Abstract

We developed REVEX, a removal-based video explanations framework. This work extends fine-grained explanation frameworks for computer vision data and adapts six existing techniques to video by adding temporal information and local explanations. The adapted methods were evaluated across networks, datasets, image classes, and evaluation metrics. By decomposing explanation into steps, strengths and weaknesses were revealed in the studied methods, for example, on pixel clustering and perturbations in the input. Video LIME outperformed other methods with deletion values up to 31\% lower and insertion up to 30\% higher, depending on method and network. Video RISE achieved superior performance in the average drop metric, with values 10\% lower. In contrast, localization-based metrics revealed low performance across all methods, with significant variation depending on network. Pointing game accuracy reached 53\%, and IoU-based metrics remained below 20\%. Drawing on the findings across XAI methods, we further examine the limitations of the employed XAI evaluation metrics and highlight their suitability in different applications.

REVEX: A Unified Framework for Removal-Based Explainable Artificial Intelligence in Video

TL;DR

This work extends fine-grained explanation frameworks for computer vision data and adapts six existing techniques to video by adding temporal information and local explanations, and further examines the limitations of the employed XAI evaluation metrics and highlights their suitability in different applications.

Abstract

We developed REVEX, a removal-based video explanations framework. This work extends fine-grained explanation frameworks for computer vision data and adapts six existing techniques to video by adding temporal information and local explanations. The adapted methods were evaluated across networks, datasets, image classes, and evaluation metrics. By decomposing explanation into steps, strengths and weaknesses were revealed in the studied methods, for example, on pixel clustering and perturbations in the input. Video LIME outperformed other methods with deletion values up to 31\% lower and insertion up to 30\% higher, depending on method and network. Video RISE achieved superior performance in the average drop metric, with values 10\% lower. In contrast, localization-based metrics revealed low performance across all methods, with significant variation depending on network. Pointing game accuracy reached 53\%, and IoU-based metrics remained below 20\%. Drawing on the findings across XAI methods, we further examine the limitations of the employed XAI evaluation metrics and highlight their suitability in different applications.
Paper Structure (30 sections, 17 figures, 8 tables)

This paper contains 30 sections, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Covert et al.'s framework covert2021explaining vs. REVEX, our extended framework. Our proposal begins and ends with segmentation and visualization, while decomposing feature removal into three components.
  • Figure 2: Video segmentation. From top to bottom, initial, middle, and final frames. From left to right, unmodified video, 3D grid, 2D grid segmentation extended with optical flow, and SLIC. Optical flow estimation was performed using PWC-Net sun2018pwc.
  • Figure 3: Example image (left) and image segmented using Felzenszwalbs's method, SLIC, quick shift, and compact watershed. Segmented regions are outlined in yellow and filled with the average color for each region.
  • Figure 4: Feature selection steps. Individual frames are displayed, although the regions span subsequent frames. Left to right, samples with a single feature removed, with all but one removed, and with about half the regions removed. For visualization, 3D grid-based segmentation and black fill were employed to remove regions.
  • Figure 5: Feature removal methods. As labeled, original input, occlusion with black, gray, average color, uniform blur filter, and up-scaled black. Only a single frame is displayed, however, segmentation uses a 3D grid (i.e., across frames). For clarity only one region is occluded in each sample.
  • ...and 12 more figures