Table of Contents
Fetching ...

Mask What Matters: Controllable Text-Guided Masking for Self-Supervised Medical Image Analysis

Ruilang Wang, Shuotong Xu, Bowen Liu, Runlin Huang, Donglong Chen, Weifeng Su

TL;DR

MWM tackles the semantic misalignment and inefficiency of random high-ratio masking in medical SSL by introducing a text-guided masking framework. It localizes task-relevant ROIs from open-vocabulary prompts using a frozen vision-language model (BiomedCLIP), refines them with SAM, and applies differentiated masking—high ratio on ROIs and low on background—within a three-stage masking pipeline that includes sparse encoding and hierarchical reconstruction. The method is annotation-free and backbone-agnostic, and it yields consistent gains across brain MRI, chest CT, and lung X-ray for classification, detection, and segmentation, while operating at substantially lower masking ratios (e.g., 40% vs. 70%). These results demonstrate that semantic guidance from natural language prompts can improve cross-task generalization and representation quality in medical image analysis, reducing data and compute demands for SSL pretraining.

Abstract

The scarcity of annotated data in specialized domains such as medical imaging presents significant challenges to training robust vision models. While self-supervised masked image modeling (MIM) offers a promising solution, existing approaches largely rely on random high-ratio masking, leading to inefficiency and poor semantic alignment. Moreover, region-aware variants typically depend on reconstruction heuristics or supervised signals, limiting their adaptability across tasks and modalities. We propose Mask What Matters, a controllable text-guided masking framework for self-supervised medical image analysis. By leveraging vision-language models for prompt-based region localization, our method flexibly applies differentiated masking to emphasize diagnostically relevant regions while reducing redundancy in background areas. This controllable design enables better semantic alignment, improved representation learning, and stronger cross-task generalizability. Comprehensive evaluation across multiple medical imaging modalities, including brain MRI, chest CT, and lung X-ray, shows that Mask What Matters consistently outperforms existing MIM methods (e.g., SparK), achieving gains of up to +3.1 percentage points in classification accuracy, +1.3 in box average precision (BoxAP), and +1.1 in mask average precision (MaskAP) for detection. Notably, it achieves these improvements with substantially lower overall masking ratios (e.g., 40\% vs. 70\%). This work demonstrates that controllable, text-driven masking can enable semantically aligned self-supervised learning, advancing the development of robust vision models for medical image analysis.

Mask What Matters: Controllable Text-Guided Masking for Self-Supervised Medical Image Analysis

TL;DR

MWM tackles the semantic misalignment and inefficiency of random high-ratio masking in medical SSL by introducing a text-guided masking framework. It localizes task-relevant ROIs from open-vocabulary prompts using a frozen vision-language model (BiomedCLIP), refines them with SAM, and applies differentiated masking—high ratio on ROIs and low on background—within a three-stage masking pipeline that includes sparse encoding and hierarchical reconstruction. The method is annotation-free and backbone-agnostic, and it yields consistent gains across brain MRI, chest CT, and lung X-ray for classification, detection, and segmentation, while operating at substantially lower masking ratios (e.g., 40% vs. 70%). These results demonstrate that semantic guidance from natural language prompts can improve cross-task generalization and representation quality in medical image analysis, reducing data and compute demands for SSL pretraining.

Abstract

The scarcity of annotated data in specialized domains such as medical imaging presents significant challenges to training robust vision models. While self-supervised masked image modeling (MIM) offers a promising solution, existing approaches largely rely on random high-ratio masking, leading to inefficiency and poor semantic alignment. Moreover, region-aware variants typically depend on reconstruction heuristics or supervised signals, limiting their adaptability across tasks and modalities. We propose Mask What Matters, a controllable text-guided masking framework for self-supervised medical image analysis. By leveraging vision-language models for prompt-based region localization, our method flexibly applies differentiated masking to emphasize diagnostically relevant regions while reducing redundancy in background areas. This controllable design enables better semantic alignment, improved representation learning, and stronger cross-task generalizability. Comprehensive evaluation across multiple medical imaging modalities, including brain MRI, chest CT, and lung X-ray, shows that Mask What Matters consistently outperforms existing MIM methods (e.g., SparK), achieving gains of up to +3.1 percentage points in classification accuracy, +1.3 in box average precision (BoxAP), and +1.1 in mask average precision (MaskAP) for detection. Notably, it achieves these improvements with substantially lower overall masking ratios (e.g., 40\% vs. 70\%). This work demonstrates that controllable, text-driven masking can enable semantically aligned self-supervised learning, advancing the development of robust vision models for medical image analysis.

Paper Structure

This paper contains 18 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of the MWM framework. The top panel shows the three-stage pipeline: (1) text-guided region localization, (2) region-aware masked image modeling, and (3) downstream transfer. The bottom panel zooms in to illustrate how prompts guide region localization and masking.
  • Figure 2: Comparison of masking strategies. Gray blocks indicate masked regions.
  • Figure 3: Representative results of text-guided region localization.
  • Figure 4: Classification performance on the Chest CT dataset under different masking strategies and ratios.