Self Attention with Temporal Prior: Can We Learn More from Arrow of Time?

Kyung Geun Kim; Byeong Tak Lee

Self Attention with Temporal Prior: Can We Learn More from Arrow of Time?

Kyung Geun Kim, Byeong Tak Lee

TL;DR

The paper tackles learning short-term temporal biases in time-series by injecting a temporal prior into Transformer attention through kernelized attention. It proposes the SAT-Transformer, leveraging exponential and periodic kernels with fixed kernel matrices to bias attention toward nearby timestamps while remaining computationally efficient, updating attention as $\widehat{\mathbf{A}} = \mathrm{softmax}\left( (\mathbf{C}^{(e)} \odot \mathbf{Q})(\mathbf{C}^{(p)} \odot \mathbf{K})^{\top} / \sqrt{d_k} \right)$. The approach is validated on three open EHR datasets (PhysioNet, MIMIC-III, eICU), showing consistent improvements in AUPRC and AUROC over strong baselines, especially in data-limited scenarios, and demonstrates that temporal kernels can yield more diverse and informative attention patterns. The results suggest that leveraging inherent temporal structure in time-series data can reduce data requirements and improve predictive performance in clinical tasks, with potential applicability beyond healthcare time-series. Overall, the SAT-Transformer provides a practical, principled way to encode temporal priors in attention, delivering meaningful gains with modest computational overhead.

Abstract

Many diverse phenomena in nature often inherently encode both short- and long-term temporal dependencies, which especially result from the direction of the flow of time. In this respect, we discovered experimental evidence suggesting that interrelations of these events are higher for closer time stamps. However, to be able for attention-based models to learn these regularities in short-term dependencies, it requires large amounts of data, which are often infeasible. This is because, while they are good at learning piece-wise temporal dependencies, attention-based models lack structures that encode biases in time series. As a resolution, we propose a simple and efficient method that enables attention layers to better encode the short-term temporal bias of these data sets by applying learnable, adaptive kernels directly to the attention matrices. We chose various prediction tasks for the experiments using Electronic Health Records (EHR) data sets since they are great examples with underlying long- and short-term temporal dependencies. Our experiments show exceptional classification results compared to best-performing models on most tasks and data sets.

Self Attention with Temporal Prior: Can We Learn More from Arrow of Time?

TL;DR

. The approach is validated on three open EHR datasets (PhysioNet, MIMIC-III, eICU), showing consistent improvements in AUPRC and AUROC over strong baselines, especially in data-limited scenarios, and demonstrates that temporal kernels can yield more diverse and informative attention patterns. The results suggest that leveraging inherent temporal structure in time-series data can reduce data requirements and improve predictive performance in clinical tasks, with potential applicability beyond healthcare time-series. Overall, the SAT-Transformer provides a practical, principled way to encode temporal priors in attention, delivering meaningful gains with modest computational overhead.

Abstract

Paper Structure (17 sections, 20 equations, 4 figures, 6 tables)

This paper contains 17 sections, 20 equations, 4 figures, 6 tables.

Introduction
Related Works
Structural Bias of Model and Generalization
Event Prediction in Medicine Using EHR
Proposed Method
Learning Underlying Structure of a Time Series
Architecture of the Attention Matrix
Experiments
Data Sets
Models and Settings
Results
Ablation Study and Extensions
Discussion
Conclusion
Appendix
...and 2 more sections

Figures (4)

Figure 1: Attention matrices of single layer vanilla Transformer. (a), (b), (c), (d) indicate the attention matrices when trained using 1%, 10%, 50% 100% of the PhysioNet data set.
Figure 2: Performance of each model with respect to reduction of data set size.
Figure 3: Attention matrices from SAT-Transformer and vanilla Transformer for all three layers in two different heads. (1), (2), and (3) denote layer 1, layer 2, and layer 3 each. (a) denotes the vanilla Transformer, and (b) denotes the SAT-Transformer. The Left and right figures in each group indicate two different heads.
Figure 4: Representative behavior of learned kernels for all three layers of SAT-Transformer. The different colors represent the behaviors of the kernels of different attention heads.

Self Attention with Temporal Prior: Can We Learn More from Arrow of Time?

TL;DR

Abstract

Self Attention with Temporal Prior: Can We Learn More from Arrow of Time?

Authors

TL;DR

Abstract

Table of Contents

Figures (4)