Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series

Guoqi Yu; Juncheng Wang; Chen Yang; Jing Qin; Angelica I. Aviles-Rivero; Shujun Wang

Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series

Guoqi Yu, Juncheng Wang, Chen Yang, Jing Qin, Angelica I. Aviles-Rivero, Shujun Wang

TL;DR

CoTAR (Core Token Aggregation-Redistribution), a centralized MLP-based module designed to replace decentralized attention, introduces a global core token that serves as a proxy to facilitate inter-token interactions, thereby enforcing a centralized aggregation and redistribution strategy.

Abstract

Accurate analysis of medical time series (MedTS) data, such as electroencephalography (EEG) and electrocardiography (ECG), plays a pivotal role in healthcare applications, including the diagnosis of brain and heart diseases. MedTS data typically exhibit two critical patterns: temporal dependencies within individual channels and channel dependencies across multiple channels. While recent advances in deep learning have leveraged Transformer-based models to effectively capture temporal dependencies, they often struggle with modeling channel dependencies. This limitation stems from a structural mismatch: MedTS signals are inherently centralized, whereas the Transformer's attention mechanism is decentralized, making it less effective at capturing global synchronization and unified waveform patterns. To address this mismatch, we propose CoTAR (Core Token Aggregation-Redistribution), a centralized MLP-based module designed to replace decentralized attention. Instead of allowing all tokens to interact directly, as in standard attention, CoTAR introduces a global core token that serves as a proxy to facilitate inter-token interactions, thereby enforcing a centralized aggregation and redistribution strategy. This design not only better aligns with the centralized nature of MedTS signals but also reduces computational complexity from quadratic to linear. Experiments on five benchmarks validate the superiority of our method in both effectiveness and efficiency, achieving up to a 12.13% improvement on the APAVA dataset, while using only 33% of the memory and 20% of the inference time compared to the previous state of the art. Code and all training scripts are available at https://github.com/Levi-Ackman/TeCh.

Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series

TL;DR

Abstract

Paper Structure (32 sections, 6 equations, 5 figures, 12 tables, 1 algorithm)

This paper contains 32 sections, 6 equations, 5 figures, 12 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Method
Attention vs Core Token Aggregation-Redistribution
Overview of TeCh
Experiments
Experiment Setting
Main Result
Ablation Study
Model Efficiency Analysis.
Robustness Analysis.
Ablation Study on 'Adaptive Dual Tokenization'.
Ablation Study on 'Core Token Aggregate-Redistribute'.
Visualization of Core Token
...and 17 more sections

Figures (5)

Figure 1: (a): Illustration of Temporal dependencies within each channel, and channel dependencies across channels. (b): Interaction between channels in EEG/ECG signals is centrally controlled by the brain/heart. (c): Attention module is a decentralized structure, where each token attends to all other tokens equally. (d): The proposed Core Token Aggregation-Redistribution (CoTAR) module operates in a centralized manner, with a core token as a proxy.
Figure 2: Illustration of attention and Core Token Aggregation-Redistribution (CoTAR). Attention is organized in a decentralized way where each token directly interacts with all tokens, introducing a Quadratic complexity. CoTAR first aggregates a core token and then redistributes it across channels to facilitate centralized channel interaction, bringing only Linear complexity.
Figure 3: Overview of TeCh. MedTS signals $X \in \mathbb{R}^{T\times C}$ are embedded into Temporal embedding and Channel embedding. Then, each embedding is processed using Transformer encoders, with attention replaced by CoTAR. The final output representation from each branch is averaged across channels and added, then projected to the final predicted logits $\hat{Y} \in \mathbb{R}^{K}$.
Figure 4: (a): Efficiency and Effectiveness analysis of TeCh and other baselines on APAVA dataset with batch size $B =128$. '#' stands for 'former' to save space. (b): Robustness of attention and CoTAR to noise when using Channel or Temporal embedding. We consistently increase the intensity $\beta$ (the standard deviation) of Gaussian random noise from $0.0$ to $20.0$ on the last channel of the PTB dataset. F1-Score is used to quantify the change.
Figure 5: T-SNE visualization of the core token generated by CoTAR and other tokens. We visualize the embedding space of both temporal and channel.

Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series

TL;DR

Abstract

Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series

Authors

TL;DR

Abstract

Table of Contents

Figures (5)