Towards a Benchmark for Colorectal Cancer Segmentation in Endorectal Ultrasound Videos: Dataset and Model Development

Yuncheng Jiang; Yiwen Hu; Zixun Zhang; Jun Wei; Chun-Mei Feng; Xuemei Tang; Xiang Wan; Yong Liu; Shuguang Cui; Zhen Li

Towards a Benchmark for Colorectal Cancer Segmentation in Endorectal Ultrasound Videos: Dataset and Model Development

Yuncheng Jiang, Yiwen Hu, Zixun Zhang, Jun Wei, Chun-Mei Feng, Xuemei Tang, Xiang Wan, Yong Liu, Shuguang Cui, Zhen Li

TL;DR

This paper tackles the challenge of colorectal cancer segmentation in endorectal ultrasound videos by introducing ERUS-10K, the first large-scale, well-annotated ERUS dataset with 77 videos and 10,000 frames (including 57 with infiltration-depth labels), and a novel Adaptive Sparse-context Transformer (ASTR). ASTR combines Adaptive Scanning Mode Augmentation (ASMA) to harmonize linear- and convex-array scans, a Sparse-context Transformer to fuse inter-frame information, and a Sparse-context Block to reduce computation, achieving state-of-the-art segmentation performance with a Dice score of 77.6% on the benchmark. The work provides a valuable dataset and a specialized model that jointly address modality gaps and temporal context, enabling more reliable automatic CRC diagnosis and depth staging from ERUS. Overall, the dataset and model establish a practical benchmark and a path toward clinical deployment of automated ERUS-based CRC analysis.

Abstract

Endorectal ultrasound (ERUS) is an important imaging modality that provides high reliability for diagnosing the depth and boundary of invasion in colorectal cancer. However, the lack of a large-scale ERUS dataset with high-quality annotations hinders the development of automatic ultrasound diagnostics. In this paper, we collected and annotated the first benchmark dataset that covers diverse ERUS scenarios, i.e. colorectal cancer segmentation, detection, and infiltration depth staging. Our ERUS-10K dataset comprises 77 videos and 10,000 high-resolution annotated frames. Based on this dataset, we further introduce a benchmark model for colorectal cancer segmentation, named the Adaptive Sparse-context TRansformer (ASTR). ASTR is designed based on three considerations: scanning mode discrepancy, temporal information, and low computational complexity. For generalizing to different scanning modes, the adaptive scanning-mode augmentation is proposed to convert between raw sector images and linear scan ones. For mining temporal information, the sparse-context transformer is incorporated to integrate inter-frame local and global features. For reducing computational complexity, the sparse-context block is introduced to extract contextual features from auxiliary frames. Finally, on the benchmark dataset, the proposed ASTR model achieves a 77.6% Dice score in rectal cancer segmentation, largely outperforming previous state-of-the-art methods.

Towards a Benchmark for Colorectal Cancer Segmentation in Endorectal Ultrasound Videos: Dataset and Model Development

TL;DR

Abstract

Paper Structure (10 sections, 5 equations, 9 figures, 1 table)

This paper contains 10 sections, 5 equations, 9 figures, 1 table.

Introduction
Method
Adaptive Scanning Mode Augmentation
Sparse-context Transformer
Loss Function
Experiment
Comparisons with State-of-the-arts
Ablation Study
Conclusion
Supplementary Material

Figures (9)

Figure 1: (a) Schematic diagram of ERUS operation. (b) Different scanning modes of ultrasound. (c) Examples of our ultrasound video dataset with corresponding labels.
Figure 2: Schematic illustration of the adaptive scanning mode augmentation (ASMA). The original frame of linear-array/convex-array mode is transformed to the frame of convex-array/linear-array mode by Polar-Cartesian coordinate system transformation, enhancing the model's generalization ability on different scanning modes.
Figure 3: Pipeline of the proposed ASTR. To generalize to different scanning modes, we first conduct data augmentation by interconverting the linear-array mode and convex-array mode in the adaptive scanning mode augmentation (ASMA). During training, the Sparse-context Transformer extracts inter-frame contexts to exploit spatiotemporal information. Furthermore, we devise a Sparse-context Block to eliminate the irrelevant background noise and reduce computational cost. Finally, the multi-frame contexts from all samples are fused for segmentation mask prediction.
Figure 4: Dataset statistic. (a) Gender distribution. (b) Age distribution. (c) Lesion size distribution.
Figure 5: Qualitative comparisons. "GT" denotes the ground truth. See Suppl. for more visualization results.
...and 4 more figures

Towards a Benchmark for Colorectal Cancer Segmentation in Endorectal Ultrasound Videos: Dataset and Model Development

TL;DR

Abstract

Towards a Benchmark for Colorectal Cancer Segmentation in Endorectal Ultrasound Videos: Dataset and Model Development

Authors

TL;DR

Abstract

Table of Contents

Figures (9)