Table of Contents
Fetching ...

Towards Gradient-based Time-Series Explanations through a SpatioTemporal Attention Network

Min Hun Lee

TL;DR

The paper tackles explaining time-series via gradient-based frame saliency by pairing a transformer-based SpatioTemporal Attention Network (STAN) with global and local views for video classification. It demonstrates that gradient-based explanations (vanilla gradient, SmoothGrad) can identify important frames with competitive accuracy on four medically relevant activities, especially for short sequences, while highlighting limitations for longer sequences where CNN-based approaches may perform better. The main contributions are (i) designing STAN with UniFormer blocks to fuse global and ROI-based local information, (ii) applying and evaluating multiple gradient-based XAI methods for time-series explanations, and (iii) providing empirical evidence on the feasibility and practicality of gradient-based time-series explanations in a healthcare-relevant setting. Overall, this work suggests transformer-based global+local attention can support informative time-series explanations, with caveats about long-sequence scalability and the need for broader generalization.

Abstract

In this paper, we explore the feasibility of using a transformer-based, spatiotemporal attention network (STAN) for gradient-based time-series explanations. First, we trained the STAN model for video classifications using the global and local views of data and weakly supervised labels on time-series data (i.e. the type of an activity). We then leveraged a gradient-based XAI technique (e.g. saliency map) to identify salient frames of time-series data. According to the experiments using the datasets of four medically relevant activities, the STAN model demonstrated its potential to identify important frames of videos.

Towards Gradient-based Time-Series Explanations through a SpatioTemporal Attention Network

TL;DR

The paper tackles explaining time-series via gradient-based frame saliency by pairing a transformer-based SpatioTemporal Attention Network (STAN) with global and local views for video classification. It demonstrates that gradient-based explanations (vanilla gradient, SmoothGrad) can identify important frames with competitive accuracy on four medically relevant activities, especially for short sequences, while highlighting limitations for longer sequences where CNN-based approaches may perform better. The main contributions are (i) designing STAN with UniFormer blocks to fuse global and ROI-based local information, (ii) applying and evaluating multiple gradient-based XAI methods for time-series explanations, and (iii) providing empirical evidence on the feasibility and practicality of gradient-based time-series explanations in a healthcare-relevant setting. Overall, this work suggests transformer-based global+local attention can support informative time-series explanations, with caveats about long-sequence scalability and the need for broader generalization.

Abstract

In this paper, we explore the feasibility of using a transformer-based, spatiotemporal attention network (STAN) for gradient-based time-series explanations. First, we trained the STAN model for video classifications using the global and local views of data and weakly supervised labels on time-series data (i.e. the type of an activity). We then leveraged a gradient-based XAI technique (e.g. saliency map) to identify salient frames of time-series data. According to the experiments using the datasets of four medically relevant activities, the STAN model demonstrated its potential to identify important frames of videos.
Paper Structure (16 sections, 3 figures, 5 tables)

This paper contains 16 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overall flow diagram of a SpatioTemporal Attention Network (STAN) for gradient-based time-series explanation: our approach first learns a transformer-based spatiotemporal attentional network for video classification using global and local views and weakly supervised labels. The STAN consists of four transformer-based attention stages. Given our STAN model, we compute the gradients of a video classification and leverage the gradient scores to identify important frames of a video.
  • Figure 2: Sample frames of the datasets: (a) Falling Down - Frontal View (b) HeadAche - Frontal View (c) ChestPain - SideView, and (d) Rehabilitation Compensation - Frontal View.
  • Figure 3: (a) Original Inputs (b) VanillaGradient (c) GradCam using the STAN model with global+local views of data. Overall, the model with the vanilla gradient highlighted the focused body areas that are relevant to falling down while the model with gradcam tended to have more diffused attention and some attention on background areas