Table of Contents
Fetching ...

Dual Cross-Attention Siamese Transformer for Rectal Tumor Regrowth Assessment in Watch-and-Wait Endoscopy

Jorge Tapias Gomez, Despoina Kanata, Aneesh Rangnekar, Christina Lee, Julio Garcia-Aguilar, Joshua Jesse Smith, Harini Veeraraghavan

TL;DR

The paper introduces SSDCA, a registration-free, dual cross-attention Siamese Swin Transformer framework that fuses restaging and follow-up endoscopic images to detect rectal cancer local regrowth during watch-and-wait surveillance. By leveraging cross-temporal attention and pretrained encoders, SSDCA achieves the best balance of accuracy, sensitivity, and specificity, with robustness to common endoscopic artifacts and interpretable attention maps that reveal spatial correspondences across timepoints. The approach demonstrates discriminative feature learning via UMAP clustering and maintains performance under realistic data variability, offering a practical tool for early LR detection in WW protocols. Overall, SSDCA provides a registration-free, temporally aware method that approaches surgeon-level accuracy and could guide timely clinical decisions in rectal cancer management.

Abstract

Increasing evidence supports watch-and-wait (WW) surveillance for patients with rectal cancer who show clinical complete response (cCR) at restaging following total neoadjuvant treatment (TNT). However, objectively accurate methods to early detect local regrowth (LR) from follow-up endoscopy images during WW are essential to manage care and prevent distant metastases. Hence, we developed a Siamese Swin Transformer with Dual Cross-Attention (SSDCA) to combine longitudinal endoscopic images at restaging and follow-up and distinguish cCR from LR. SSDCA leverages pretrained Swin transformers to extract domain agnostic features and enhance robustness to imaging variations. Dual cross attention is implemented to emphasize features from the two scans without requiring any spatial alignment of images to predict response. SSDCA as well as Swin-based baselines were trained using image pairs from 135 patients and evaluated on a held-out set of image pairs from 62 patients. SSDCA produced the best balanced accuracy (81.76\% $\pm$ 0.04), sensitivity (90.07\% $\pm$ 0.08), and specificity (72.86\% $\pm$ 0.05). Robustness analysis showed stable performance irrespective of artifacts including blood, stool, telangiectasia, and poor image quality. UMAP clustering of extracted features showed maximal inter-cluster separation (1.45 $\pm$ 0.18) and minimal intra-cluster dispersion (1.07 $\pm$ 0.19) with SSDCA, confirming discriminative representation learning.

Dual Cross-Attention Siamese Transformer for Rectal Tumor Regrowth Assessment in Watch-and-Wait Endoscopy

TL;DR

The paper introduces SSDCA, a registration-free, dual cross-attention Siamese Swin Transformer framework that fuses restaging and follow-up endoscopic images to detect rectal cancer local regrowth during watch-and-wait surveillance. By leveraging cross-temporal attention and pretrained encoders, SSDCA achieves the best balance of accuracy, sensitivity, and specificity, with robustness to common endoscopic artifacts and interpretable attention maps that reveal spatial correspondences across timepoints. The approach demonstrates discriminative feature learning via UMAP clustering and maintains performance under realistic data variability, offering a practical tool for early LR detection in WW protocols. Overall, SSDCA provides a registration-free, temporally aware method that approaches surgeon-level accuracy and could guide timely clinical decisions in rectal cancer management.

Abstract

Increasing evidence supports watch-and-wait (WW) surveillance for patients with rectal cancer who show clinical complete response (cCR) at restaging following total neoadjuvant treatment (TNT). However, objectively accurate methods to early detect local regrowth (LR) from follow-up endoscopy images during WW are essential to manage care and prevent distant metastases. Hence, we developed a Siamese Swin Transformer with Dual Cross-Attention (SSDCA) to combine longitudinal endoscopic images at restaging and follow-up and distinguish cCR from LR. SSDCA leverages pretrained Swin transformers to extract domain agnostic features and enhance robustness to imaging variations. Dual cross attention is implemented to emphasize features from the two scans without requiring any spatial alignment of images to predict response. SSDCA as well as Swin-based baselines were trained using image pairs from 135 patients and evaluated on a held-out set of image pairs from 62 patients. SSDCA produced the best balanced accuracy (81.76\% 0.04), sensitivity (90.07\% 0.08), and specificity (72.86\% 0.05). Robustness analysis showed stable performance irrespective of artifacts including blood, stool, telangiectasia, and poor image quality. UMAP clustering of extracted features showed maximal inter-cluster separation (1.45 0.18) and minimal intra-cluster dispersion (1.07 0.19) with SSDCA, confirming discriminative representation learning.

Paper Structure

This paper contains 17 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: (a) Endoscopic images acquired every three months during watch-and-wait, producing a longitudinal sequence from $t_0$ to $t_n$. (b) Training and testing setup in which image pairs from different timepoints are combined and provided as input into the Siamese Swin Dual Cross-Attention (SSDCA) model.
  • Figure 2: Architecture overview of the model.
  • Figure 3: UMAP visualization of feature embeddings from the testing set. Each point represents an image combination (or a single image for Swin-S SI), color-coded by final clinical outcome. LR = local regrowth, cCR = complete clinical response.
  • Figure 4: GradCAM and attention maps for two representative test cases produced using SSDCA shows good correspondence of relevant spatial features between the images.
  • Figure 5: SSDCA output probability distributions under different imaging variations (blood, stool, TLG, and PQ). Green represents the correctly predicted sample, and red the incorrectly predicted samples.