Challenges in explaining deep learning models for data with biological variation

Lenka Tětková; Erik Schou Dreier; Robin Malm; Lars Kai Hansen

Challenges in explaining deep learning models for data with biological variation

Lenka Tětková, Erik Schou Dreier, Robin Malm, Lars Kai Hansen

TL;DR

The paper tackles the challenge of explaining deep learning decisions on biologically variable data by studying grain defect detection with a comprehensive evaluation of post-hoc heatmap methods. It introduces a robust workflow that jointly considers data robustness, explanation fidelity, and ground-truth alignment, and it demonstrates that no single method universally dominates across metrics. A distribution-aligned aggregation strategy and Monte Carlo-based ranking provide a practical, data-adaptive approach to selecting explainability methods, with LRP (EpsilonPlusFlat) often performing best for grain data. The findings highlight the importance of carefully documenting hyperparameters and evaluation choices, and they offer a framework applicable to other non-standard, biology-rich datasets where ground truth for explanations is ill-defined.

Abstract

Much machine learning research progress is based on developing models and evaluating them on a benchmark dataset (e.g., ImageNet for images). However, applying such benchmark-successful methods to real-world data often does not work as expected. This is particularly the case for biological data where we expect variability at multiple time and spatial scales. In this work, we are using grain data and the goal is to detect diseases and damages. Pink fusarium, skinned grains, and other diseases and damages are key factors in setting the price of grains or excluding dangerous grains from food production. Apart from challenges stemming from differences of the data from the standard toy datasets, we also present challenges that need to be overcome when explaining deep learning models. For example, explainability methods have many hyperparameters that can give different results, and the ones published in the papers do not work on dissimilar images. Other challenges are more general: problems with visualization of the explanations and their comparison since the magnitudes of their values differ from method to method. An open fundamental question also is: How to evaluate explanations? It is a non-trivial task because the "ground truth" is usually missing or ill-defined. Also, human annotators may create what they think is an explanation of the task at hand, yet the machine learning model might solve it in a different and perhaps counter-intuitive way. We discuss several of these challenges and evaluate various post-hoc explainability methods on grain data. We focus on robustness, quality of explanations, and similarity to particular "ground truth" annotations made by experts. The goal is to find the methods that overall perform well and could be used in this challenging task. We hope the proposed pipeline will be used as a framework for evaluating explainability methods in specific use cases.

Challenges in explaining deep learning models for data with biological variation

TL;DR

Abstract

Paper Structure (29 sections, 18 figures, 15 tables)

This paper contains 29 sections, 18 figures, 15 tables.

Introduction
Challenges
Challenges of evaluation
Ground truth
Hyperparameters
Channel pooling
Visualization
Aggregating multiple explanations
Methods
Data
Models
Explainability methods
Methods for evaluating the quality of explanations
Quality evaluation
Similarity to ground truth
...and 14 more sections

Figures (18)

Figure 1: Our paper explores challenges that one faces when explaining image classifiers, especially for the application on data with biological variation. We train a convolutional network for grain defect detection. We discuss some of the challenges and questions that arise in connection to applying post-hoc explanation methods on this model. For example, the choice of the explainability method is crucial. We evaluate the quality of explanations, perform an extensive analysis of some of these choices, and present the results, showing how big an impact each choice has.
Figure 2: Examples of images of grains with pink fusarium (left) and skinned (right) with human annotations.
Figure 3: LIME explanations of the same image with two different segmentations.
Figure 4: Heatmaps of three LRP methods used in this paper. These methods differ only in the propagation rules used for each type of layer. For details about the individual methods see \ref{['sec:expl_methods']}.
Figure 5: Mean (left) and max (right) pooling of Gradients explanation.
...and 13 more figures

Challenges in explaining deep learning models for data with biological variation

TL;DR

Abstract

Challenges in explaining deep learning models for data with biological variation

Authors

TL;DR

Abstract

Table of Contents

Figures (18)