MVEB: Self-Supervised Learning with Multi-View Entropy Bottleneck

Liangjian Wen; Xiasi Wang; Jianzhuang Liu; Zenglin Xu

MVEB: Self-Supervised Learning with Multi-View Entropy Bottleneck

Liangjian Wen, Xiasi Wang, Jianzhuang Liu, Zenglin Xu

TL;DR

MVEB addresses the intractability of mutual-information-based minimal sufficiency by reframing the objective as maximizing the agreement between two view embeddings and their differential entropy, implemented via a score-based entropy estimator using a von Mises–Fisher kernel. The method yields a tractable, end-to-end objective for Siamese SSL without large negative banks or architectural tricks, and establishes new state-of-the-art results on ImageNet linear evaluation with a vanilla ResNet-50 (Top-1 $=76.9\%$). It generalizes well across tasks including semi-supervised classification, transfer learning to diverse datasets, and detection/segmentation on COCO, underscoring the practical impact of learning a minimal sufficient representation. The work also clarifies the relationship between alignment, uniformity, and information bottlenecks, offering a unified perspective that connects contrastive, asymmetric, and decorrelation approaches under the MVEB framework.

Abstract

Self-supervised learning aims to learn representation that can be effectively generalized to downstream tasks. Many self-supervised approaches regard two views of an image as both the input and the self-supervised signals, assuming that either view contains the same task-relevant information and the shared information is (approximately) sufficient for predicting downstream tasks. Recent studies show that discarding superfluous information not shared between the views can improve generalization. Hence, the ideal representation is sufficient for downstream tasks and contains minimal superfluous information, termed minimal sufficient representation. One can learn this representation by maximizing the mutual information between the representation and the supervised view while eliminating superfluous information. Nevertheless, the computation of mutual information is notoriously intractable. In this work, we propose an objective termed multi-view entropy bottleneck (MVEB) to learn minimal sufficient representation effectively. MVEB simplifies the minimal sufficient learning to maximizing both the agreement between the embeddings of two views and the differential entropy of the embedding distribution. Our experiments confirm that MVEB significantly improves performance. For example, it achieves top-1 accuracy of 76.9\% on ImageNet with a vanilla ResNet-50 backbone on linear evaluation. To the best of our knowledge, this is the new state-of-the-art result with ResNet-50.

MVEB: Self-Supervised Learning with Multi-View Entropy Bottleneck

TL;DR

). It generalizes well across tasks including semi-supervised classification, transfer learning to diverse datasets, and detection/segmentation on COCO, underscoring the practical impact of learning a minimal sufficient representation. The work also clarifies the relationship between alignment, uniformity, and information bottlenecks, offering a unified perspective that connects contrastive, asymmetric, and decorrelation approaches under the MVEB framework.

Abstract

Paper Structure (23 sections, 40 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 23 sections, 40 equations, 4 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Preliminary: Minimal Sufficient Representation
Approach
Multi-View Information Bottleneck
Multi-View Entropy Bottleneck
Analysis of the Variational Approximation
Score-Based Entropy Estimation with the von Mises-Fisher Kernel
Rethinking Alignment and Uniformity
Main Results
Pretraining Details
Linear Evaluation on ImageNet
Semi-Supervised Classification on ImageNet
Transfer Learning
Object Detection and Segmentation
...and 8 more sections

Figures (4)

Figure 1: Illustration of the sufficient and minimal representation in the unsupervised multi-view setting. The common assumption in multi-view learning is that the information $I(\mathbf{v_1};\mathbf{v_2})$ shared between view $\mathbf{v_1}$ and view $\mathbf{v_2}$ is sufficient for the prediction of downstream tasks DBLP:conf/colt/SridharanK08. When $I\left(\mathbf{z_1}; \mathbf{v_2}\right)=I\left(\mathbf{v_1}; \mathbf{v_2}\right)$, $\mathbf{z_1}$ (denoted by the dotted line in the above figure) contains all the task-relevant information shared between the two views (left). Hence, $\mathbf{z_1}$ is a sufficient representation. If all superfluous information is eliminated, i.e., $I\left(\mathbf{z_1}; \mathbf{v_1}\right)=I\left(\mathbf{v_1}; \mathbf{v_2}\right)$ (right), $\mathbf{z_1}$ is the minimal sufficient representation.
Figure 2: Framework of MVEB and its training objective.
Figure 3: Visualization of the multi-view information bottleneck model.
Figure 4: Linear classification on ImageNet by MVEB pretrained with different coefficients $\beta$. Collapse means that the accuracy of the linear classification is 0.

MVEB: Self-Supervised Learning with Multi-View Entropy Bottleneck

TL;DR

Abstract

MVEB: Self-Supervised Learning with Multi-View Entropy Bottleneck

Authors

TL;DR

Abstract

Table of Contents

Figures (4)