Towards a Benchmark for Colorectal Cancer Segmentation in Endorectal Ultrasound Videos: Dataset and Model Development
Yuncheng Jiang, Yiwen Hu, Zixun Zhang, Jun Wei, Chun-Mei Feng, Xuemei Tang, Xiang Wan, Yong Liu, Shuguang Cui, Zhen Li
TL;DR
This paper tackles the challenge of colorectal cancer segmentation in endorectal ultrasound videos by introducing ERUS-10K, the first large-scale, well-annotated ERUS dataset with 77 videos and 10,000 frames (including 57 with infiltration-depth labels), and a novel Adaptive Sparse-context Transformer (ASTR). ASTR combines Adaptive Scanning Mode Augmentation (ASMA) to harmonize linear- and convex-array scans, a Sparse-context Transformer to fuse inter-frame information, and a Sparse-context Block to reduce computation, achieving state-of-the-art segmentation performance with a Dice score of 77.6% on the benchmark. The work provides a valuable dataset and a specialized model that jointly address modality gaps and temporal context, enabling more reliable automatic CRC diagnosis and depth staging from ERUS. Overall, the dataset and model establish a practical benchmark and a path toward clinical deployment of automated ERUS-based CRC analysis.
Abstract
Endorectal ultrasound (ERUS) is an important imaging modality that provides high reliability for diagnosing the depth and boundary of invasion in colorectal cancer. However, the lack of a large-scale ERUS dataset with high-quality annotations hinders the development of automatic ultrasound diagnostics. In this paper, we collected and annotated the first benchmark dataset that covers diverse ERUS scenarios, i.e. colorectal cancer segmentation, detection, and infiltration depth staging. Our ERUS-10K dataset comprises 77 videos and 10,000 high-resolution annotated frames. Based on this dataset, we further introduce a benchmark model for colorectal cancer segmentation, named the Adaptive Sparse-context TRansformer (ASTR). ASTR is designed based on three considerations: scanning mode discrepancy, temporal information, and low computational complexity. For generalizing to different scanning modes, the adaptive scanning-mode augmentation is proposed to convert between raw sector images and linear scan ones. For mining temporal information, the sparse-context transformer is incorporated to integrate inter-frame local and global features. For reducing computational complexity, the sparse-context block is introduced to extract contextual features from auxiliary frames. Finally, on the benchmark dataset, the proposed ASTR model achieves a 77.6% Dice score in rectal cancer segmentation, largely outperforming previous state-of-the-art methods.
