CoBooM: Codebook Guided Bootstrapping for Medical Image Representation Learning

Azad Singh; Deepak Mishra

CoBooM: Codebook Guided Bootstrapping for Medical Image Representation Learning

Azad Singh, Deepak Mishra

TL;DR

CoBooM addresses the gap in self-supervised medical image learning by leveraging a codebook to capture anatomical similarities. It combines continuous context/target encodings with a discrete codebook through a Quantizer and fuses them using DiversiFuse cross-attention in a BYOL-like framework. The approach outperforms multiple SSL baselines on chest X-ray and fundus datasets in both linear probing and semi-supervised settings, with notable gains in classification and segmentation and minimal need for backbone fine-tuning. This provides a scalable, resource-efficient method for learning transferable medical image representations from unlabeled data.

Abstract

Self-supervised learning (SSL) has emerged as a promising paradigm for medical image analysis by harnessing unannotated data. Despite their potential, the existing SSL approaches overlook the high anatomical similarity inherent in medical images. This makes it challenging for SSL methods to capture diverse semantic content in medical images consistently. This work introduces a novel and generalized solution that implicitly exploits anatomical similarities by integrating codebooks in SSL. The codebook serves as a concise and informative dictionary of visual patterns, which not only aids in capturing nuanced anatomical details but also facilitates the creation of robust and generalized feature representations. In this context, we propose CoBooM, a novel framework for self-supervised medical image learning by integrating continuous and discrete representations. The continuous component ensures the preservation of fine-grained details, while the discrete aspect facilitates coarse-grained feature extraction through the structured embedding space. To understand the effectiveness of CoBooM, we conduct a comprehensive evaluation of various medical datasets encompassing chest X-rays and fundus images. The experimental results reveal a significant performance gain in classification and segmentation tasks.

CoBooM: Codebook Guided Bootstrapping for Medical Image Representation Learning

TL;DR

Abstract

Paper Structure (9 sections, 1 equation, 2 figures, 2 tables)

This paper contains 9 sections, 1 equation, 2 figures, 2 tables.

Introduction
Background
Methodology
Quantizer
DiversiFuse (Feature Fusion with Multi-Head Cross Attention):
Loss Function
Experimental Setup
Results and Discussion
Conclusion

Figures (2)

Figure 1: The architecture overview of the proposed framework. EMA is an exponential moving average used to update the parameters of the Target encoder. $g_\theta$ and $g_\phi$ are the three MLP networks that serve as projection heads for Context and Target encoders. $f'_\theta$ serves as the decoder network.
Figure 2: Diagnostic maps for Atelectasis, Effusion, Cardiomegaly, and Mass corresponding to the X-ray images from NIH indicate that CoBooM captures pathological features effectively compared to other best-performing baseline methods. The bounding box indicates the ground truth.

CoBooM: Codebook Guided Bootstrapping for Medical Image Representation Learning

TL;DR

Abstract

CoBooM: Codebook Guided Bootstrapping for Medical Image Representation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (2)