Table of Contents
Fetching ...

Improving Representation of High-frequency Components for Medical Visual Foundation Models

Yuetan Chu, Yilan Zhang, Zhongyi Han, Changchun Yang, Longxi Zhou, Gongning Luo, Chao Huang, Xin Gao

TL;DR

Medical foundation models often fail to capture high-frequency, fine-grained details essential for accurate diagnosis. Frepa introduces a dual masking framework and histogram-equalized masking to boost high-frequency representations, supported by hierarchical frequency-to-spatial and embedding-consistency losses, and extends from 2D to 3D with a transfer pathway for 2D encoders. Pretrained on 17 million images across 9 modalities and evaluated on 32 downstream tasks, Frepa consistently improves performance, especially on fine-grained segmentation and detection, and generalizes to external modalities. By directly enhancing the preservation of high-frequency information in embeddings, Frepa advances toward more universal and robust medical image foundation models. Code and models are publicly available for replication and extension.

Abstract

Foundation models have recently attracted significant attention for their impressive generalizability across diverse downstream tasks. However, these models are demonstrated to exhibit great limitations in representing high-frequency components and fine-grained details. In many medical imaging tasks, the precise representation of such information is crucial due to the inherently intricate anatomical structures, sub-visual features, and complex boundaries involved. Consequently, the limited representation of prevalent foundation models can result in significant performance degradation or even failure in these tasks. To address these challenges, we propose a novel pretraining strategy, named Frequency-advanced Representation Autoencoder (Frepa). Through high-frequency masking and low-frequency perturbation combined with adversarial learning, Frepa encourages the encoder to effectively represent and preserve high-frequency components in the image embeddings. Additionally, we introduce an innovative histogram-equalized image masking strategy, extending the Masked Autoencoder approach beyond ViT to other architectures such as Swin Transformer and convolutional networks. We develop Frepa across nine medical modalities and validate it on 32 downstream tasks for both 2D images and 3D volume data. Without fine-tuning, Frepa can outperform other self-supervised pretraining methods and, in some cases, even surpasses task-specific trained models. This improvement is particularly significant for tasks involving fine-grained details, such as achieving up to a +15% increase in DSC for retina vessel segmentation and a +7% increase in IoU for lung nodule detection. Further experiments quantitatively reveal that Frepa enables superior high-frequency representations and preservation in the embeddings, underscoring its potential for developing more generalized and universal medical image foundation models.

Improving Representation of High-frequency Components for Medical Visual Foundation Models

TL;DR

Medical foundation models often fail to capture high-frequency, fine-grained details essential for accurate diagnosis. Frepa introduces a dual masking framework and histogram-equalized masking to boost high-frequency representations, supported by hierarchical frequency-to-spatial and embedding-consistency losses, and extends from 2D to 3D with a transfer pathway for 2D encoders. Pretrained on 17 million images across 9 modalities and evaluated on 32 downstream tasks, Frepa consistently improves performance, especially on fine-grained segmentation and detection, and generalizes to external modalities. By directly enhancing the preservation of high-frequency information in embeddings, Frepa advances toward more universal and robust medical image foundation models. Code and models are publicly available for replication and extension.

Abstract

Foundation models have recently attracted significant attention for their impressive generalizability across diverse downstream tasks. However, these models are demonstrated to exhibit great limitations in representing high-frequency components and fine-grained details. In many medical imaging tasks, the precise representation of such information is crucial due to the inherently intricate anatomical structures, sub-visual features, and complex boundaries involved. Consequently, the limited representation of prevalent foundation models can result in significant performance degradation or even failure in these tasks. To address these challenges, we propose a novel pretraining strategy, named Frequency-advanced Representation Autoencoder (Frepa). Through high-frequency masking and low-frequency perturbation combined with adversarial learning, Frepa encourages the encoder to effectively represent and preserve high-frequency components in the image embeddings. Additionally, we introduce an innovative histogram-equalized image masking strategy, extending the Masked Autoencoder approach beyond ViT to other architectures such as Swin Transformer and convolutional networks. We develop Frepa across nine medical modalities and validate it on 32 downstream tasks for both 2D images and 3D volume data. Without fine-tuning, Frepa can outperform other self-supervised pretraining methods and, in some cases, even surpasses task-specific trained models. This improvement is particularly significant for tasks involving fine-grained details, such as achieving up to a +15% increase in DSC for retina vessel segmentation and a +7% increase in IoU for lung nodule detection. Further experiments quantitatively reveal that Frepa enables superior high-frequency representations and preservation in the embeddings, underscoring its potential for developing more generalized and universal medical image foundation models.
Paper Structure (24 sections, 15 equations, 7 figures, 9 tables)

This paper contains 24 sections, 15 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Comparison of the ability of high-frequency representation (a) Classification accuracy on high-pass filtered test set. Initially, models are trained on the raw images. Subsequently, a high-pass filter is applied to the test images to progressively remove low-frequency components, and the classification accuracy of these low-frequency corrupted images is evaluated. Other models exhibit a significant decrease in accuracy as the filtering size increases, indicating limitations in representing high-frequency components, while our method (Frepa) maintains more robust performances. The filtered images are shown at the top. (b) Segmentation results for tasks involving fine-grained details. From top left to bottom right: retinal vessel segmentation, pneumonia segmentation, pulmonary artery segmentation, lung nodule detection. Even after the decoder is re-trained, the segmentation-specific foundation model, MedSAM, still demonstrates poor performance for fine-grained segmentation tasks.
  • Figure 2: Overall architecture of the proposed Frepa. (a) The pretrained Frepa employs two parallel strategies: frequency dual-component masking and histogram-equalized masking. These strategies are applied to the original image with an equal probability to obtain the corrupted image. (b) Illustrations of the distance calculation and the exponential decay function used in the frequency dual-component masking. (c) The method for extending the 2D pretrained encoder to 3D volume data.
  • Figure 3: Histogram of different image masking strategies. Direct zero-masking will result in a severe shift of histogram distribution, while our proposed two masking strategies can preserve the raw distribution.
  • Figure 4: Comparison of the raw image (a) and the reconstructed image (b) optimized with FFL loss. The frequency spectrum is shown on the right side. The reconstructed images involve aliasing artifacts, which could be attributed to the over-fitting in the frequency domain. The image inside the orange box is zoomed in for better visualization.
  • Figure 5: Example results of reconstructed images on external datasets. The images are corrupted by random masking and low-frequency filtering, respectively. Notably, such low-frequency filtered images are not seen during the training phases of Frepa. We visualize both the images and their frequency spectrum. RMAE is shown in the upper left corner of each image. Zoom in to an appropriate size for better viewing of the images.
  • ...and 2 more figures