Interpretable Pre-Trained Transformers for Heart Time-Series Data

Harry J. Davies; James Monsen; Danilo P. Mandic

Interpretable Pre-Trained Transformers for Heart Time-Series Data

Harry J. Davies, James Monsen, Danilo P. Mandic

TL;DR

The paper tackles the challenge of interpretable analysis of clinical heart time-series by introducing two decoder-only transformer models, PPG-PT and ECG-PT, trained to predict the next token in PPG and ECG sequences. It demonstrates interpretability through aggregate attention, phase-based token clustering, and head-level feature maps, while achieving strong fine-tuning performance on atrial fibrillation detection and PPG beat detection. The models are pre-trained on large, diverse cardiac datasets and show robust generalization to unseen morphologies, with AF AUCs up to 0.99 (ECG-PT) and 0.93 (PPG-PT) and beat-detection F1 around 98%. The work advances explainable AI in healthcare by providing transparent reasoning pathways alongside accurate clinical predictions and a practical fine-tuning workflow.

Abstract

Decoder-only transformers are the backbone of the popular generative pre-trained transformer (GPT) series of large language models. In this work, we employ this framework to the analysis of clinical heart time-series data, to create two pre-trained general purpose cardiac models, termed PPG-PT and ECG-PT. We place a special emphasis on making both such pre-trained models fully interpretable. This is achieved firstly through aggregate attention maps which show that, in order to make predictions, the model focuses on similar points in previous cardiac cycles and gradually broadens its attention in deeper layers. Next, we show that tokens with the same value, which occur at different distinct points in the electrocardiography (ECG) and photoplethysmography (PPG) cycle, form separate clusters in high dimensional space. The clusters form according to phase, as the tokens propagate through the transformer blocks. Finally, we highlight that individual attention heads respond to specific physiologically relevent features, such as the dicrotic notch in PPG and the P-wave in ECG. It is also demonstrated that these pre-trained models are straightforward to fine-tune for tasks such as classification of atrial fibrillation (AF), and beat detection in photoplethysmography. For the example of AF, the fine-tuning took 11 minutes of computer time, and achieved the respective leave-one-subject-out AUCs of 0.99 and 0.93 for ECG and PPG within the MIMIC Perform AF dataset. In addition, the fine-tuned beat detector achieved a state-of-the-art F1 score of 98%, as well as uniquely providing a beat confidence level which acts as a signal quality estimator. Importantly, the fine-tuned models for AF screening are also fully explainable, with attention shifting to regions in the context that are strongly indicative of atrial fibrillation.

Interpretable Pre-Trained Transformers for Heart Time-Series Data

TL;DR

Abstract

Paper Structure (19 sections, 1 equation, 6 figures, 1 table)

This paper contains 19 sections, 1 equation, 6 figures, 1 table.

Introduction
The Pre-Trained Transformer Models
Tokenisation and Embedding
Multi-Head Masked Self Attention and the Transformer Block
Architecture
Pre-Training
Fine-Tuning
Atrial Fibrillation
PPG Beat Detection
Cross Entropy Loss and Next Token Prediction vs Mean Squared Error
Generative Model Evaluation
Interpretability of the Pre-Trained Models
Interpretability of aggregate model attention
Vector similarities between points of interest
Attention Maps of Individual Attention Heads
...and 4 more sections

Figures (6)

Figure 1: Generation error (in terms of absolute error) for both the PPG-PT and ECG-PT base models for a duration of half the length of the context window, evaluated over 40 unseen subjects from the bed-based balistocardiography dataset. The examples demonstrate that whilst the models were able to accurately capture features of the context, such as pulse shape and the dicrotic notch in the case of PPG, and all sections of the P-QRS-T for the ECG, large errors can come from slight temporal misalignment. (a) Two examples of model generations for PPG-PT, with context in black, ground truth in blue and model prediction in red. Both examples provide the maximum error of the prediction, with a large error occurring in the first, and a more typical error occurring in the second example. (b) The median prediction error (red solid line) and interquartile range (red shaded area) for PPG-PT across all windows. (c) Two examples of model generations for ECG-PT, with context in black, ground truth in blue and model prediction in red. Both examples provide the maximum absolute error of the prediction, with two examples of large errors due to slight misalignments in prediction. (d) The median prediction error (red solid line) and interquartile range (red shaded area) for ECG-PT across all windows.
Figure 2: Aggregate attention across all attention heads for the attention row corresponding to the prediction point, per transformer layer shown (the first and last layers), for both the PPG-PT and ECG-PT models. The attention maps demonstrate that in order to predict the next token, the models first look at all tokens in the immediate cycle to gain an understanding of where a token falls within that cycle. When this understanding has been established, models then look at the all similar points occuring in other cycles in the context window. For both models, attention in the first transformer layer is shown on top and the last transformer layer beneath, with context (black line) and the prediction point (blue circle) corresponding to overlaid transformer attention (red transparent bars), with transparency scaled based on the attention weights shown below (red solid line). (a) The PPG-PT aggregate attention, for the prediction of a peak in the given previously unseen photoplethysmography context. (b) The ECG-PT aggregate attention, for the prediction of a peak in the given previously unseen electrocardiography context.
Figure 3: Plots showing the cosine similarity of points in rising slopes (blue circles, blue solid lines) and falling slopes (red circles, red solid lines) with similar input values, upon propagation through the transformer layers of the model. In each case, a single point on a rising slope is chosen as a comparison point, represented by a black cross on the context (black solid line), and a black dotted line with a cosine similarity of 1 in the cosine similarity plot. The point of propagation through the model is highlighted with grey transparent blocks labeled with the corresponding transformer layer (from 1 to 8). In both examples, rising slope points and falling slope points are shuffled in embedding space upon input to the first model layer, and gradually divide into two clusters as they propagate through the model. (a) The cosine similarity in embedding space of rising slopes and falling slopes in a photoplethysmography signal, upon propagation through the PPG-PT model. (b) The cosine similarity in embedding space of rising slopes and falling slopes in the T-wave of an electrocardiography signal, upon propagation through the ECG-PT model.
Figure 4: Attention maps of the attention weights of individual attention heads of the last layer of both the PPG-PT and ECG-PT models. Averaged attention maps are generated by averaging the final layer of the attention weight matrix over the prediction of the next 500 tokens beyond the initial context window. The model context is plotted in black, with attention weights plotted in red, pink, and blue for different attention heads. Peaks in attention are highlighted in the same colour with circles, and overlaid on the context with bars. (a) Mapping of the averaged attention weights in the final layer attention heads of the PPG-PT model for an example photoplethysmography context. Observe that the 3rd attention head (red) looks primarily for peaks in PPG, the 4th attention head (pink) looks primarily for troughs in PPG, and the 6th (blue) looks for the dicrotic notch. (b) Mapping of the averaged attention weights in the final layer attention heads of the ECG-PT model for an example electrocardiography context. The 1st attention head (red) looks for R peaks in the ECG, the 5th attention head (pink) looks primarily for P waves, and the 7th attention head (blue) looks primarily for the Q portion of the QRS complex of ECG.
Figure 5: Changes in final layer attention weights of the base PPG-PT and ECG-PT models when fine-tuned to classifying atrial fibrillation. The model input context is shown in black, with attention overlaid with red bars of varying transparency based on the change in attention weights. The full difference in attention weights is shown below each plot in red. (a) Example fine-tuned PPG-PT test results, with examples of pulses occurring later than expected and earlier than expected, and the corresponding spikes in attention weights. (b) Example fine-tuned ECG-PT test results, again showing beats occurring later or earlier than expected with the aforementioned spikes in attention.
...and 1 more figures

Interpretable Pre-Trained Transformers for Heart Time-Series Data

TL;DR

Abstract

Interpretable Pre-Trained Transformers for Heart Time-Series Data

Authors

TL;DR

Abstract

Table of Contents

Figures (6)