Dual Cross-Attention Siamese Transformer for Rectal Tumor Regrowth Assessment in Watch-and-Wait Endoscopy
Jorge Tapias Gomez, Despoina Kanata, Aneesh Rangnekar, Christina Lee, Julio Garcia-Aguilar, Joshua Jesse Smith, Harini Veeraraghavan
TL;DR
The paper introduces SSDCA, a registration-free, dual cross-attention Siamese Swin Transformer framework that fuses restaging and follow-up endoscopic images to detect rectal cancer local regrowth during watch-and-wait surveillance. By leveraging cross-temporal attention and pretrained encoders, SSDCA achieves the best balance of accuracy, sensitivity, and specificity, with robustness to common endoscopic artifacts and interpretable attention maps that reveal spatial correspondences across timepoints. The approach demonstrates discriminative feature learning via UMAP clustering and maintains performance under realistic data variability, offering a practical tool for early LR detection in WW protocols. Overall, SSDCA provides a registration-free, temporally aware method that approaches surgeon-level accuracy and could guide timely clinical decisions in rectal cancer management.
Abstract
Increasing evidence supports watch-and-wait (WW) surveillance for patients with rectal cancer who show clinical complete response (cCR) at restaging following total neoadjuvant treatment (TNT). However, objectively accurate methods to early detect local regrowth (LR) from follow-up endoscopy images during WW are essential to manage care and prevent distant metastases. Hence, we developed a Siamese Swin Transformer with Dual Cross-Attention (SSDCA) to combine longitudinal endoscopic images at restaging and follow-up and distinguish cCR from LR. SSDCA leverages pretrained Swin transformers to extract domain agnostic features and enhance robustness to imaging variations. Dual cross attention is implemented to emphasize features from the two scans without requiring any spatial alignment of images to predict response. SSDCA as well as Swin-based baselines were trained using image pairs from 135 patients and evaluated on a held-out set of image pairs from 62 patients. SSDCA produced the best balanced accuracy (81.76\% $\pm$ 0.04), sensitivity (90.07\% $\pm$ 0.08), and specificity (72.86\% $\pm$ 0.05). Robustness analysis showed stable performance irrespective of artifacts including blood, stool, telangiectasia, and poor image quality. UMAP clustering of extracted features showed maximal inter-cluster separation (1.45 $\pm$ 0.18) and minimal intra-cluster dispersion (1.07 $\pm$ 0.19) with SSDCA, confirming discriminative representation learning.
