Constrained Multiview Representation for Self-supervised Contrastive Learning

Siyuan Dai; Kai Ye; Kun Zhao; Ge Cui; Haoteng Tang; Liang Zhan

Constrained Multiview Representation for Self-supervised Contrastive Learning

Siyuan Dai, Kai Ye, Kun Zhao, Ge Cui, Haoteng Tang, Liang Zhan

TL;DR

This work tackles the challenge of learning robust representations for medical image segmentation under variable lesion distributions by introducing a frequency-domain, multi-view contrastive learning framework guided by mutual information (MI). It generates multiple frequency-domain views via offline DCT on CT slices, ranks them by MI with latent representations using a MINE-based estimator, and selects a fixed-cost subset to drive a multi-view contrastive objective (MIMIC) integrated into segmentation backbones. The approach combines continuous MI maximization with view selection, and a joint loss that includes BCE, Dice, and the contrastive/MI terms, yielding improved segmentation performance on three COVID-19 CT datasets, with ablations highlighting the value of MI-driven view selection. The framework demonstrates strong improvements over baselines on multiple metrics and remains effective under parameter analyses, offering a practical, semi-supervised capable path to better representation learning in medical imaging.

Abstract

Representation learning constitutes a pivotal cornerstone in contemporary deep learning paradigms, offering a conduit to elucidate distinctive features within the latent space and interpret the deep models. Nevertheless, the inherent complexity of anatomical patterns and the random nature of lesion distribution in medical image segmentation pose significant challenges to the disentanglement of representations and the understanding of salient features. Methods guided by the maximization of mutual information, particularly within the framework of contrastive learning, have demonstrated remarkable success and superiority in decoupling densely intertwined representations. However, the effectiveness of contrastive learning highly depends on the quality of the positive and negative sample pairs, i.e. the unselected average mutual information among multi-views would obstruct the learning strategy so the selection of the views is vital. In this work, we introduce a novel approach predicated on representation distance-based mutual information (MI) maximization for measuring the significance of different views, aiming at conducting more efficient contrastive learning and representation disentanglement. Additionally, we introduce an MI re-ranking strategy for representation selection, benefiting both the continuous MI estimating and representation significance distance measuring. Specifically, we harness multi-view representations extracted from the frequency domain, re-evaluating their significance based on mutual information across varying frequencies, thereby facilitating a multifaceted contrastive learning approach to bolster semantic comprehension. The statistical results under the five metrics demonstrate that our proposed framework proficiently constrains the MI maximization-driven representation selection and steers the multi-view contrastive learning process.

Constrained Multiview Representation for Self-supervised Contrastive Learning

TL;DR

Abstract

Paper Structure (24 sections, 11 equations, 9 figures, 3 tables, 2 algorithms)

This paper contains 24 sections, 11 equations, 9 figures, 3 tables, 2 algorithms.

Introduction
Related Work
Representation Learning
Mutual Information
Frequency Domain Information
Methodology
Muti-view Data Generation In The Frequency Domain
Constrained Multi-view Mutual Information
Mutual Information
Multi-view Mutual Information Maximization For Feature Selection
Multi-view Contrastive Learning
Single-view Contrastive Loss
Multi-view Contrastive Loss
Segmentation Framework
Implement Details
...and 9 more sections

Figures (9)

Figure 1: Given a generated multi-view DCT cube from the original image, deep representation features will be re-ranked and selected according to the distance with the original image representation in the information entropy space.
Figure 2: An overview of the off-line DCT transformation module for generating multi-view cube. Every original image slice is partitioned into small image patches after subject-wise normalization. Consequently, a DCT transformation is implemented on image patches. Finally, the coefficient cube for the whole image is generated from frequency-based flattened ($F^{2}$) and frequency-wise normalization (FN) operations.
Figure 3: The proposed framework introduces a multi-view contrastive learning strategy executed within the latent space, utilizing selectively chosen views to refine representation learning. Although Cosine similarity is the primary distance metric in our setting, the framework's design permits the adoption of alternative metrics, such as the L1 norm, L2 norm, or others, depending on the specific requirements. Given a generated multi-view DCT cube, the selection is based on its proximity to the original image representation within the information entropy space.
Figure 4: Visualization of the segmentation results on COVID19-CT-Seg20 (row 1), COVID19-CT-100(row 2), and MosMedData (row 3) three datasets produced by our proposed MIMIC framework under two backbones and other 4 competitive baselines. The true positive, false negative and false positive are highlighted with red, green, and blue, respectively.
Figure 5: Visualization of the segmentation results in ablation study on COVID19-CT-100 (row 1), COVID19-CT-Seg20 (row 2), and MosMedData (row 3) three datasets produced by our proposed MIMIC framework under two backbones and their used baselines. The true positive, false negative, and false positive are highlighted with red, green, and blue, respectively.
...and 4 more figures

Constrained Multiview Representation for Self-supervised Contrastive Learning

TL;DR

Abstract

Constrained Multiview Representation for Self-supervised Contrastive Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (9)