Table of Contents
Fetching ...

Autoregressive Medical Image Segmentation via Next-Scale Mask Prediction

Tao Chen, Chenhui Wang, Zhihao Chen, Hongming Shan

TL;DR

AR-Seg tackles medical image segmentation by explicitly modeling inter-scale dependencies through a next-scale autoregressive framework. It introduces a multi-scale mask autoencoder to produce hierarchical token maps, a next-scale autoregressive segmentor that conditions on all previously predicted scales and an image embedding, and a consensus-aggregation strategy to fuse multiple samples. The segmentation likelihood is modeled as $p(r_1, cdots,r_K)=\prod_{k=1}^K p_\theta(r_k|r_1, cdots,r_{k-1}, c, f)$, and experiments on $LIDC-IDRI$ and BRATS 2021 demonstrate state-of-the-art performance with explicit coarse-to-fine visualization. This work improves robustness in anatomically variable regions and offers interpretable, progressive segmentation results that can support clinical decision-making.

Abstract

While deep learning has significantly advanced medical image segmentation, most existing methods still struggle with handling complex anatomical regions. Cascaded or deep supervision-based approaches attempt to address this challenge through multi-scale feature learning but fail to establish sufficient inter-scale dependencies, as each scale relies solely on the features of the immediate predecessor. To this end, we propose the AutoRegressive Segmentation framework via next-scale mask prediction, termed AR-Seg, which progressively predicts the next-scale mask by explicitly modeling dependencies across all previous scales within a unified architecture. AR-Seg introduces three innovations: (1) a multi-scale mask autoencoder that quantizes the mask into multi-scale token maps to capture hierarchical anatomical structures, (2) a next-scale autoregressive mechanism that progressively predicts next-scale masks to enable sufficient inter-scale dependencies, and (3) a consensus-aggregation strategy that combines multiple sampled results to generate a more accurate mask, further improving segmentation robustness. Extensive experimental results on two benchmark datasets with different modalities demonstrate that AR-Seg outperforms state-of-the-art methods while explicitly visualizing the intermediate coarse-to-fine segmentation process.

Autoregressive Medical Image Segmentation via Next-Scale Mask Prediction

TL;DR

AR-Seg tackles medical image segmentation by explicitly modeling inter-scale dependencies through a next-scale autoregressive framework. It introduces a multi-scale mask autoencoder to produce hierarchical token maps, a next-scale autoregressive segmentor that conditions on all previously predicted scales and an image embedding, and a consensus-aggregation strategy to fuse multiple samples. The segmentation likelihood is modeled as , and experiments on and BRATS 2021 demonstrate state-of-the-art performance with explicit coarse-to-fine visualization. This work improves robustness in anatomically variable regions and offers interpretable, progressive segmentation results that can support clinical decision-making.

Abstract

While deep learning has significantly advanced medical image segmentation, most existing methods still struggle with handling complex anatomical regions. Cascaded or deep supervision-based approaches attempt to address this challenge through multi-scale feature learning but fail to establish sufficient inter-scale dependencies, as each scale relies solely on the features of the immediate predecessor. To this end, we propose the AutoRegressive Segmentation framework via next-scale mask prediction, termed AR-Seg, which progressively predicts the next-scale mask by explicitly modeling dependencies across all previous scales within a unified architecture. AR-Seg introduces three innovations: (1) a multi-scale mask autoencoder that quantizes the mask into multi-scale token maps to capture hierarchical anatomical structures, (2) a next-scale autoregressive mechanism that progressively predicts next-scale masks to enable sufficient inter-scale dependencies, and (3) a consensus-aggregation strategy that combines multiple sampled results to generate a more accurate mask, further improving segmentation robustness. Extensive experimental results on two benchmark datasets with different modalities demonstrate that AR-Seg outperforms state-of-the-art methods while explicitly visualizing the intermediate coarse-to-fine segmentation process.

Paper Structure

This paper contains 10 sections, 5 equations, 7 figures, 2 tables, 2 algorithms.

Figures (7)

  • Figure 1: Illustration of the proposed AR-Seg.
  • Figure 2: Results on LIDC-IDRI.
  • Figure 3: Results on BRATS 2021.
  • Figure 4: Qualitative results of two lung nodules from LIDC-IDRI.$\boldsymbol{y}^{i}$ and $\bar{\boldsymbol{y}}$ refer to the $i$-th segmentation masks and the final consensus-aggregated masks, respectively.
  • Figure 5: Qualitative results of four MRI images from BRATS 2021. Only T1-weighted images are shown for convenience.
  • ...and 2 more figures