Table of Contents
Fetching ...

Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation

Hongbo Zheng, Afshin Bozorgpour, Dorit Merhof, Minjia Zhang

TL;DR

PVT-GDLA is introduced, a decoder-centric Transformer that restores sharp, long-range dependencies at linear time and provides a practical path to fast, scalable, high-fidelity medical segmentation in clinical environments and other resource-constrained settings.

Abstract

Medical image segmentation requires models that preserve fine anatomical boundaries while remaining efficient for clinical deployment. While transformers capture long-range dependencies, they suffer from quadratic attention cost and large data requirements, whereas CNNs are compute-friendly yet struggle with global reasoning. Linear attention offers $\mathcal{O}(N)$ scaling, but often exhibits training instability and attention dilution, yielding diffuse maps. We introduce PVT-GDLA, a decoder-centric Transformer that restores sharp, long-range dependencies at linear time. Its core, Gated Differential Linear Attention (GDLA), computes two kernelized attention paths on complementary query/key subspaces and subtracts them with a learnable, channel-wise scale to cancel common-mode noise and amplify relevant context. A lightweight, head-specific gate injects nonlinearity and input-adaptive sparsity, mitigating attention sink, and a parallel local token-mixing branch with depthwise convolution strengthens neighboring-token interactions, improving boundary fidelity, all while retaining $\mathcal{O}(N)$ complexity and low parameter overhead. Coupled with a pretrained Pyramid Vision Transformer (PVT) encoder, PVT-GDLA achieves state-of-the-art accuracy across CT, MRI, ultrasound, and dermoscopy benchmarks under equal training budgets, with comparable parameters but lower FLOPs than CNN-, Transformer-, hybrid-, and linear-attention baselines. PVT-GDLA provides a practical path to fast, scalable, high-fidelity medical segmentation in clinical environments and other resource-constrained settings.

Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation

TL;DR

PVT-GDLA is introduced, a decoder-centric Transformer that restores sharp, long-range dependencies at linear time and provides a practical path to fast, scalable, high-fidelity medical segmentation in clinical environments and other resource-constrained settings.

Abstract

Medical image segmentation requires models that preserve fine anatomical boundaries while remaining efficient for clinical deployment. While transformers capture long-range dependencies, they suffer from quadratic attention cost and large data requirements, whereas CNNs are compute-friendly yet struggle with global reasoning. Linear attention offers scaling, but often exhibits training instability and attention dilution, yielding diffuse maps. We introduce PVT-GDLA, a decoder-centric Transformer that restores sharp, long-range dependencies at linear time. Its core, Gated Differential Linear Attention (GDLA), computes two kernelized attention paths on complementary query/key subspaces and subtracts them with a learnable, channel-wise scale to cancel common-mode noise and amplify relevant context. A lightweight, head-specific gate injects nonlinearity and input-adaptive sparsity, mitigating attention sink, and a parallel local token-mixing branch with depthwise convolution strengthens neighboring-token interactions, improving boundary fidelity, all while retaining complexity and low parameter overhead. Coupled with a pretrained Pyramid Vision Transformer (PVT) encoder, PVT-GDLA achieves state-of-the-art accuracy across CT, MRI, ultrasound, and dermoscopy benchmarks under equal training budgets, with comparable parameters but lower FLOPs than CNN-, Transformer-, hybrid-, and linear-attention baselines. PVT-GDLA provides a practical path to fast, scalable, high-fidelity medical segmentation in clinical environments and other resource-constrained settings.
Paper Structure (60 sections, 34 equations, 9 figures, 5 tables)

This paper contains 60 sections, 34 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Comparison of model performance on Synapse dataset and compute--accuracy trade-off. Our approach achieves the highest Dice score with lower # parameters and lower # FLOPs.
  • Figure 2: PVT-GDLA overview. A pretrained Pyramid Vision Transformer (PVT) encoder produces multi-scale features that feed a decoder built from GDLA blocks, which incorporates a GDLA mixer and a FFN. In each GDLA mixer, queries/keys are split into complementary subspaces; each branch performs gated differential kernelized linear attention. A parallel local token-mixing branch (depthwise $3{\times}3$ followed by $1{\times}1$) reinforces neighborhood interactions; fused outputs are projected and upsampled with skip connections to recover spatial resolution. Positional embedding is replaced with a $3\times3$ depth-wise convolution (DWC). Deep-supervision pmlr-v38-lee15a is employed to improve the convergence speed.
  • Figure 3: Gated Differential Linear Attention (GDLA). Two complementary query/key subspaces $\boldsymbol{Q}_{1}$, $\boldsymbol{K}_{1}$ and $\boldsymbol{Q}_{2}$, $\boldsymbol{K}_{2}$ are formed from $\boldsymbol{X}$. Each branch computes kernelized linear attention in $\mathcal{O}(N)$ by first contracting $\phi(\boldsymbol{K}_i)^{\intercal}\boldsymbol{V}$, then mixing with $\phi(\boldsymbol{Q}_i)$ and normalizing by $\boldsymbol{z}_{i}$. We use $\phi(\cdot) = \mathrm{ELU}+1$ as the the kernal function to ensure non-negativity in attention scores. The outputs are combined via a learnable, channel-wise subtraction $\boldsymbol{A}_1-\boldsymbol{A}_2\odot\boldsymbol{\lambda}$, stabilized with $\mathrm{RMSNorm}$, and modulated by a data-dependent $\mathrm{Sigmoid}$ gate $\boldsymbol{G}_i=\sigma(\boldsymbol{X}\boldsymbol{W}^{G}_{i})$.
  • Figure 4: Qualitative results of the proposed method versus other approaches on the Synapse dataset.
  • Figure 5: Visual comparison of the proposed method versus others on the Skin datasets. PH$^{2}$ (top) and HAM10000 (bottom). The blue and the green lines represent prediction and ground truth respectively.
  • ...and 4 more figures