Constrained Multiview Representation for Self-supervised Contrastive Learning
Siyuan Dai, Kai Ye, Kun Zhao, Ge Cui, Haoteng Tang, Liang Zhan
TL;DR
This work tackles the challenge of learning robust representations for medical image segmentation under variable lesion distributions by introducing a frequency-domain, multi-view contrastive learning framework guided by mutual information (MI). It generates multiple frequency-domain views via offline DCT on CT slices, ranks them by MI with latent representations using a MINE-based estimator, and selects a fixed-cost subset to drive a multi-view contrastive objective (MIMIC) integrated into segmentation backbones. The approach combines continuous MI maximization with view selection, and a joint loss that includes BCE, Dice, and the contrastive/MI terms, yielding improved segmentation performance on three COVID-19 CT datasets, with ablations highlighting the value of MI-driven view selection. The framework demonstrates strong improvements over baselines on multiple metrics and remains effective under parameter analyses, offering a practical, semi-supervised capable path to better representation learning in medical imaging.
Abstract
Representation learning constitutes a pivotal cornerstone in contemporary deep learning paradigms, offering a conduit to elucidate distinctive features within the latent space and interpret the deep models. Nevertheless, the inherent complexity of anatomical patterns and the random nature of lesion distribution in medical image segmentation pose significant challenges to the disentanglement of representations and the understanding of salient features. Methods guided by the maximization of mutual information, particularly within the framework of contrastive learning, have demonstrated remarkable success and superiority in decoupling densely intertwined representations. However, the effectiveness of contrastive learning highly depends on the quality of the positive and negative sample pairs, i.e. the unselected average mutual information among multi-views would obstruct the learning strategy so the selection of the views is vital. In this work, we introduce a novel approach predicated on representation distance-based mutual information (MI) maximization for measuring the significance of different views, aiming at conducting more efficient contrastive learning and representation disentanglement. Additionally, we introduce an MI re-ranking strategy for representation selection, benefiting both the continuous MI estimating and representation significance distance measuring. Specifically, we harness multi-view representations extracted from the frequency domain, re-evaluating their significance based on mutual information across varying frequencies, thereby facilitating a multifaceted contrastive learning approach to bolster semantic comprehension. The statistical results under the five metrics demonstrate that our proposed framework proficiently constrains the MI maximization-driven representation selection and steers the multi-view contrastive learning process.
