Table of Contents
Fetching ...

Self-Supervised Learning as Discrete Communication

Kawtar Zaher, Ilyass Moummad, Olivier Buisson, Alexis Joly

TL;DR

This work reframes self-supervised learning as a discrete communication task between a teacher and a student, exchanging information through a fixed-capacity binary channel and enforcing multi-label binary agreement via a binary cross-entropy objective. It introduces a coding-rate regularizer applied to the pre-binarization logits to encourage efficient, diverse use of the channel, and periodically randomizes the projection head to boost robustness across encodings. The resulting Binary Information Transmission for Self-Supervision (BITS) framework yields more factorized representations, improves image retrieval and downstream transfer, and reveals a compact, reusable discrete language learned by the binary codes. Overall, discretizing the SSL agreement mechanism, not the embeddings, provides a practical path to structured representations with strong empirical gains across vision tasks and domain shifts.

Abstract

Most self-supervised learning (SSL) methods learn continuous visual representations by aligning different views of the same input, offering limited control over how information is structured across representation dimensions. In this work, we frame visual self-supervised learning as a discrete communication process between a teacher and a student network, where semantic information is transmitted through a fixed-capacity binary channel. Rather than aligning continuous features, the student predicts multi-label binary messages produced by the teacher. Discrete agreement is enforced through an element-wise binary cross-entropy objective, while a coding-rate regularization term encourages effective utilization of the constrained channel, promoting structured representations. We further show that periodically reinitializing the projection head strengthens this effect by encouraging embeddings that remain predictive across multiple discrete encodings. Extensive experiments demonstrate consistent improvements over continuous agreement baselines on image classification, retrieval, and dense visual prediction tasks, as well as under domain shift through self-supervised adaptation. Beyond backbone representations, we analyze the learned binary codes and show that they form a compact and informative discrete language, capturing semantic factors reusable across classes.

Self-Supervised Learning as Discrete Communication

TL;DR

This work reframes self-supervised learning as a discrete communication task between a teacher and a student, exchanging information through a fixed-capacity binary channel and enforcing multi-label binary agreement via a binary cross-entropy objective. It introduces a coding-rate regularizer applied to the pre-binarization logits to encourage efficient, diverse use of the channel, and periodically randomizes the projection head to boost robustness across encodings. The resulting Binary Information Transmission for Self-Supervision (BITS) framework yields more factorized representations, improves image retrieval and downstream transfer, and reveals a compact, reusable discrete language learned by the binary codes. Overall, discretizing the SSL agreement mechanism, not the embeddings, provides a practical path to structured representations with strong empirical gains across vision tasks and domain shifts.

Abstract

Most self-supervised learning (SSL) methods learn continuous visual representations by aligning different views of the same input, offering limited control over how information is structured across representation dimensions. In this work, we frame visual self-supervised learning as a discrete communication process between a teacher and a student network, where semantic information is transmitted through a fixed-capacity binary channel. Rather than aligning continuous features, the student predicts multi-label binary messages produced by the teacher. Discrete agreement is enforced through an element-wise binary cross-entropy objective, while a coding-rate regularization term encourages effective utilization of the constrained channel, promoting structured representations. We further show that periodically reinitializing the projection head strengthens this effect by encouraging embeddings that remain predictive across multiple discrete encodings. Extensive experiments demonstrate consistent improvements over continuous agreement baselines on image classification, retrieval, and dense visual prediction tasks, as well as under domain shift through self-supervised adaptation. Beyond backbone representations, we analyze the learned binary codes and show that they form a compact and informative discrete language, capturing semantic factors reusable across classes.
Paper Structure (22 sections, 11 equations, 10 figures, 14 tables)

This paper contains 22 sections, 11 equations, 10 figures, 14 tables.

Figures (10)

  • Figure 1: Cumulative explained variance of the dimensions of the representations of ImageNet-1k validation set.
  • Figure 2: Visualization of t-SNE embeddings of the hash codes.
  • Figure 3: Visualization of images from two ImageNet classes conditioned on the value of bit 0 in each row. Top row (bit=0) contains humans. Bottom row (bit=1) consistently shows no humans.
  • Figure 4: Pre-training on ImageNet-1k: mAP score evolution across epochs.
  • Figure 5: Linear probing results evolution on Birds525 (left) and PlantNet300k (right).
  • ...and 5 more figures