Towards Gradient-based Time-Series Explanations through a SpatioTemporal Attention Network
Min Hun Lee
TL;DR
The paper tackles explaining time-series via gradient-based frame saliency by pairing a transformer-based SpatioTemporal Attention Network (STAN) with global and local views for video classification. It demonstrates that gradient-based explanations (vanilla gradient, SmoothGrad) can identify important frames with competitive accuracy on four medically relevant activities, especially for short sequences, while highlighting limitations for longer sequences where CNN-based approaches may perform better. The main contributions are (i) designing STAN with UniFormer blocks to fuse global and ROI-based local information, (ii) applying and evaluating multiple gradient-based XAI methods for time-series explanations, and (iii) providing empirical evidence on the feasibility and practicality of gradient-based time-series explanations in a healthcare-relevant setting. Overall, this work suggests transformer-based global+local attention can support informative time-series explanations, with caveats about long-sequence scalability and the need for broader generalization.
Abstract
In this paper, we explore the feasibility of using a transformer-based, spatiotemporal attention network (STAN) for gradient-based time-series explanations. First, we trained the STAN model for video classifications using the global and local views of data and weakly supervised labels on time-series data (i.e. the type of an activity). We then leveraged a gradient-based XAI technique (e.g. saliency map) to identify salient frames of time-series data. According to the experiments using the datasets of four medically relevant activities, the STAN model demonstrated its potential to identify important frames of videos.
