Table of Contents
Fetching ...

FocusMAE: Gallbladder Cancer Detection from Ultrasound Videos with Focused Masked Autoencoders

Soumen Basu, Mayuna Gupta, Chetan Madan, Pankaj Gupta, Chetan Arora

TL;DR

This work reframes gallbladder cancer detection from ultrasound as a video-based task to leverage spatiotemporal cues. It introduces FocusMAE, a region-prior-guided masked autoencoder that biases token masking toward high-information regions identified by region proposals, coupled with a 3D tokenization and a transformer-based encoder/decoder. On a newly compiled US video dataset, FocusMAE achieves a 96.4% accuracy with perfect sensitivity, surpassing image-based SOTA and existing video baselines, and it also improves COVID detection accuracy on a public CT dataset by about 2.2%. The method demonstrates generality across modalities and diseases, suggesting broad applicability for robust medical video representation learning.

Abstract

In recent years, automated Gallbladder Cancer (GBC) detection has gained the attention of researchers. Current state-of-the-art (SOTA) methodologies relying on ultrasound sonography (US) images exhibit limited generalization, emphasizing the need for transformative approaches. We observe that individual US frames may lack sufficient information to capture disease manifestation. This study advocates for a paradigm shift towards video-based GBC detection, leveraging the inherent advantages of spatiotemporal representations. Employing the Masked Autoencoder (MAE) for representation learning, we address shortcomings in conventional image-based methods. We propose a novel design called FocusMAE to systematically bias the selection of masking tokens from high-information regions, fostering a more refined representation of malignancy. Additionally, we contribute the most extensive US video dataset for GBC detection. We also note that, this is the first study on US video-based GBC detection. We validate the proposed methods on the curated dataset, and report a new state-of-the-art (SOTA) accuracy of 96.4% for the GBC detection problem, against an accuracy of 84% by current Image-based SOTA - GBCNet, and RadFormer, and 94.7% by Video-based SOTA - AdaMAE. We further demonstrate the generality of the proposed FocusMAE on a public CT-based Covid detection dataset, reporting an improvement in accuracy by 3.3% over current baselines. The source code and pretrained models are available at: https://gbc-iitd.github.io/focusmae

FocusMAE: Gallbladder Cancer Detection from Ultrasound Videos with Focused Masked Autoencoders

TL;DR

This work reframes gallbladder cancer detection from ultrasound as a video-based task to leverage spatiotemporal cues. It introduces FocusMAE, a region-prior-guided masked autoencoder that biases token masking toward high-information regions identified by region proposals, coupled with a 3D tokenization and a transformer-based encoder/decoder. On a newly compiled US video dataset, FocusMAE achieves a 96.4% accuracy with perfect sensitivity, surpassing image-based SOTA and existing video baselines, and it also improves COVID detection accuracy on a public CT dataset by about 2.2%. The method demonstrates generality across modalities and diseases, suggesting broad applicability for robust medical video representation learning.

Abstract

In recent years, automated Gallbladder Cancer (GBC) detection has gained the attention of researchers. Current state-of-the-art (SOTA) methodologies relying on ultrasound sonography (US) images exhibit limited generalization, emphasizing the need for transformative approaches. We observe that individual US frames may lack sufficient information to capture disease manifestation. This study advocates for a paradigm shift towards video-based GBC detection, leveraging the inherent advantages of spatiotemporal representations. Employing the Masked Autoencoder (MAE) for representation learning, we address shortcomings in conventional image-based methods. We propose a novel design called FocusMAE to systematically bias the selection of masking tokens from high-information regions, fostering a more refined representation of malignancy. Additionally, we contribute the most extensive US video dataset for GBC detection. We also note that, this is the first study on US video-based GBC detection. We validate the proposed methods on the curated dataset, and report a new state-of-the-art (SOTA) accuracy of 96.4% for the GBC detection problem, against an accuracy of 84% by current Image-based SOTA - GBCNet, and RadFormer, and 94.7% by Video-based SOTA - AdaMAE. We further demonstrate the generality of the proposed FocusMAE on a public CT-based Covid detection dataset, reporting an improvement in accuracy by 3.3% over current baselines. The source code and pretrained models are available at: https://gbc-iitd.github.io/focusmae
Paper Structure (21 sections, 4 equations, 9 figures, 4 tables)

This paper contains 21 sections, 4 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: (a) Masking strategy of FocusMAE in comparison to existing random patch maest, frame wei2022masked, tube videomae masking. Our approach selects more tokens from the semantically meaningful regions with a small number of background tokens for masking. (b) Inflating the masking probability of the tokens which spatially lie within the object region (gray region) by $\pi$ increases the accuracy. However, excessive masking of the object region degrades performance. Blue line: accuracy of the original random masking.
  • Figure 2: Overview of the proposed FocusMAE pipeline. Our design proposes guiding the masking tokens with the localization of the candidate focus regions containing high-information. The systematic biasing with focused high-information region priors helps to build a more meaningful reconstruction task for disease representation learning.
  • Figure 3: Sample video sequences from our US video dataset used for GBC detection, and the public COVID-CT-MD dataset covidctmd. We show samples of both malignant and benign (non-malignant) sequences for GBC data. For the covid data, we show sample sequences for both Covid and non-Covid categories.
  • Figure 4: Visual demonstration of the benefit of using the FocusMAE method. (a) Original frames from a US video sequence exhibiting GB malignancy. ROI is drawn in red. (b) Candidate regions as prior (in yellow). (c) Masking by FocusMAE. (d), (e) Attention visualization for the downstream malignancy detection for VideoMAE and FocusMAE, respectively. For FocusMAE, the attention is well guided to the key regions containing the malignancy, as opposed to VideoMAE. (f) CT Slices of a sample Covid patient. (g) Attention visualization of FocusMAE.
  • Figure 5: Ablation study. We report the mean scores over 5-fold cross-validation for GBC detection. (a) Effect of varying the masking ratio ($\rho$) on accuracy. (b) Effect of varying the reconstruction loss - L1 vs. MSE - for SSL pretraining. Training with MSE yields 2.1% better accuracy. (c) Performance for different backbones. (d) Effect of varying the decoder depth.
  • ...and 4 more figures