Table of Contents
Fetching ...

A novel application of Shapley values for large multidimensional time-series data: Applying explainable AI to a DNA profile classification neural network

Lauren Elborough, Duncan Taylor, Melissa Humphries

TL;DR

This work solves the computational barrier of applying Shapley values to high-dimensional time-series data by adapting the image-based superpixel idea into a focusing occlusion framework. Using Kernel SHAP, the method iteratively partitions a DNA profile (6 dye lanes × 5200 points per lane, total $31{,}200$ points) into blocks and refines the most contributive regions, reducing the combinatorial burden from $2^{F}$ to a tractable sequence of occlusions. The approach yields per-scan-point Shapley values in about $1$ second and profiles in under an hour, with results aligning with expert domain knowledge and enabling defensible, explainable classifications in forensic DNA analysis. Beyond DNA profiling, the method provides a generalizable tool for explainability in vectorized time-series data across finance, biology, and other fields requiring transparent decision-making for high-dimensional inputs.

Abstract

The application of Shapley values to high-dimensional, time-series-like data is computationally challenging - and sometimes impossible. For $N$ inputs the problem is $2^N$ hard. In image processing, clusters of pixels, referred to as superpixels, are used to streamline computations. This research presents an efficient solution for time-seres-like data that adapts the idea of superpixels for Shapley value computation. Motivated by a forensic DNA classification example, the method is applied to multivariate time-series-like data whose features have been classified by a convolutional neural network (CNN). In DNA processing, it is important to identify alleles from the background noise created by DNA extraction and processing. A single DNA profile has $31,200$ scan points to classify, and the classification decisions must be defensible in a court of law. This means that classification is routinely performed by human readers - a monumental and time consuming process. The application of a CNN with fast computation of meaningful Shapley values provides a potential alternative to the classification. This research demonstrates the realistic, accurate and fast computation of Shapley values for this massive task

A novel application of Shapley values for large multidimensional time-series data: Applying explainable AI to a DNA profile classification neural network

TL;DR

This work solves the computational barrier of applying Shapley values to high-dimensional time-series data by adapting the image-based superpixel idea into a focusing occlusion framework. Using Kernel SHAP, the method iteratively partitions a DNA profile (6 dye lanes × 5200 points per lane, total points) into blocks and refines the most contributive regions, reducing the combinatorial burden from to a tractable sequence of occlusions. The approach yields per-scan-point Shapley values in about second and profiles in under an hour, with results aligning with expert domain knowledge and enabling defensible, explainable classifications in forensic DNA analysis. Beyond DNA profiling, the method provides a generalizable tool for explainability in vectorized time-series data across finance, biology, and other fields requiring transparent decision-making for high-dimensional inputs.

Abstract

The application of Shapley values to high-dimensional, time-series-like data is computationally challenging - and sometimes impossible. For inputs the problem is hard. In image processing, clusters of pixels, referred to as superpixels, are used to streamline computations. This research presents an efficient solution for time-seres-like data that adapts the idea of superpixels for Shapley value computation. Motivated by a forensic DNA classification example, the method is applied to multivariate time-series-like data whose features have been classified by a convolutional neural network (CNN). In DNA processing, it is important to identify alleles from the background noise created by DNA extraction and processing. A single DNA profile has scan points to classify, and the classification decisions must be defensible in a court of law. This means that classification is routinely performed by human readers - a monumental and time consuming process. The application of a CNN with fast computation of meaningful Shapley values provides a potential alternative to the classification. This research demonstrates the realistic, accurate and fast computation of Shapley values for this massive task
Paper Structure (13 sections, 3 equations, 5 figures, 3 tables)

This paper contains 13 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: An example of a DNA profile for the GlobalFiler$^{TM}$ profiling system. The profile has six dye lanes, and measures the relative fluorescence units (RFUs) on the y-axis are measured over a sequence of base pairs on the x-axis. The profile contains both DNA and artefacts, to be removed during the reading. The boxes above each dye lane represent different 24 different regions of DNA that are targetted in GlobalFiler$^{TM}$ during the PCR process.
  • Figure 2: An example of isolating features of importance in a DNA profile using the kernel SHAP algorithm, for predicting an allele in dye lane 3. The blue areas correspond to positive Shapley values, the red to negative Shapley values, and the grey areas represent negligible Shapley values. The strength of the colours varies according to the magnitude of the values.
  • Figure 3: An example of the third iteration of the kernel SHAP algorithm, for classifying a pull-up in the centre of dye lane 4. The blue areas represent positive Shapley values, and the red areas represent negative Shapley values. The remaining areas in the DNA profile had negligible values, indicating that they did not considerably influence the prediction.
  • Figure 4: A example of the Kernel SHAP algorithm results for classifying the centre scan point in dye lane 4 as a pull-up, after three focussing iterations. The graph is in the style of a DNA profile, where each row of nine boxplots represents one dye lane, and the x-axis displays the position of each block within its dye lane.
  • Figure 5: The workflow of the kernel SHAP iteration process. In the first step, the kernal SHAP algorithm is applied to each of the six dye lanes, highlighting the top two Shapley values which are either positive (blue) or negative (red). The top two selected regions are then split into three regions, and the process continues until the desired granularity is reached.