Table of Contents
Fetching ...

XAttnRes: Cross-Stage Attention Residuals for Medical Image Segmentation

Xinyu Liu, Qing Xu, Zhen Chen

Abstract

In the field of Large Language Models (LLMs), Attention Residuals have recently demonstrated that learned, selective aggregation over all preceding layer outputs can outperform fixed residual connections. We propose Cross-Stage Attention Residuals (XAttnRes), a mechanism that maintains a global feature history pool accumulating both encoder and decoder stage outputs. Through lightweight pseudo-query attention, each stage selectively aggregates from all preceding representations. To bridge the gap between the same-dimensional Transformer layers in LLMs and the multi-scale encoder-decoder stages in segmentation networks, XAttnRes introduces spatial alignment and channel projection steps that handle cross-resolution features with negligible overhead. When added to existing segmentation networks, XAttnRes consistently improves performance across four datasets and three imaging modalities. We further observe that XAttnRes alone, even without skip connections, achieves performance on par with the baseline, suggesting that learned aggregation can recover the inter-stage information flow traditionally provided by predetermined connections.

XAttnRes: Cross-Stage Attention Residuals for Medical Image Segmentation

Abstract

In the field of Large Language Models (LLMs), Attention Residuals have recently demonstrated that learned, selective aggregation over all preceding layer outputs can outperform fixed residual connections. We propose Cross-Stage Attention Residuals (XAttnRes), a mechanism that maintains a global feature history pool accumulating both encoder and decoder stage outputs. Through lightweight pseudo-query attention, each stage selectively aggregates from all preceding representations. To bridge the gap between the same-dimensional Transformer layers in LLMs and the multi-scale encoder-decoder stages in segmentation networks, XAttnRes introduces spatial alignment and channel projection steps that handle cross-resolution features with negligible overhead. When added to existing segmentation networks, XAttnRes consistently improves performance across four datasets and three imaging modalities. We further observe that XAttnRes alone, even without skip connections, achieves performance on par with the baseline, suggesting that learned aggregation can recover the inter-stage information flow traditionally provided by predetermined connections.

Paper Structure

This paper contains 27 sections, 10 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Effect of XAttnRes across two backbones (U-Net and EMCAD) and two benchmarks (Synapse multi-organ CT and ColonDB polyp segmentation). Adding XAttnRes alongside existing architectures ("XAttnRes + skip") consistently improves over the baseline. Removing skip connections ("No Skip") degrades performance, but XAttnRes alone ("replace") recovers most of this drop. Dashed lines indicate the baseline.
  • Figure 2: Architecture overview. (a) Standard U-Net with fixed skip connections between resolution-matched encoder and decoder stages. (b) U-Net with XAttnRes (replace): skip connections are entirely removed. Each stage reads from a causally growing history pool ($e_1, \ldots, e_S$ for encoder; $e_1, \ldots, e_S, d_1, \ldots$ for decoder) via lightweight pseudo-query attention, and appends its output for subsequent stages. The XAttnRes detail (right) shows how it aligns multi-scale features, computes attention logits via a single learnable vector $\mathbf{w}$, and outputs a weighted sum of the original values.
  • Figure 3: Qualitative comparison across four datasets. Each row shows one dataset (Synapse, ColonDB, ClinicDB, ISIC 2017). Columns from left to right: UNet 3+, UCTransNet, U-Net, U-Net + XAttnRes (ours), EMCAD, EMCAD + XAttnRes (ours), and ground truth.