Table of Contents
Fetching ...

Information-Maximized Soft Variable Discretization for Self-Supervised Image Representation Learning

Chuang Niu, Wenjun Xia, Hongming Shan, Ge Wang

TL;DR

This work tackles the challenge of effective self-supervised image representation learning by directly optimizing information measures over discretized latent variables. It introduces Information-Maximized Soft Variable Discretization (IMSVD), a non-hard-discretization SSL framework that softly quantizes latent variables and estimates their distributions to compute information-based objectives. A joint-cross entropy loss with a theoretical IMSVD theorem drives one-hot, transform-invariant, and redundancy-minimized embeddings, while enabling a discriminative, contrastive-like behavior without negative samples. Empirically, IMSVD achieves competitive or superior accuracy and improved efficiency on ImageNet linear evaluation, KNN classification, and transfer tasks, with interpretable discrete features and robust performance across training settings.

Abstract

Self-supervised learning (SSL) has emerged as a crucial technique in image processing, encoding, and understanding, especially for developing today's vision foundation models that utilize large-scale datasets without annotations to enhance various downstream tasks. This study introduces a novel SSL approach, Information-Maximized Soft Variable Discretization (IMSVD), for image representation learning. Specifically, IMSVD softly discretizes each variable in the latent space, enabling the estimation of their probability distributions over training batches and allowing the learning process to be directly guided by information measures. Motivated by the MultiView assumption, we propose an information-theoretic objective function to learn transform-invariant, non-travail, and redundancy-minimized representation features. We then derive a joint-cross entropy loss function for self-supervised image representation learning, which theoretically enjoys superiority over the existing methods in reducing feature redundancy. Notably, our non-contrastive IMSVD method statistically performs contrastive learning. Extensive experimental results demonstrate the effectiveness of IMSVD on various downstream tasks in terms of both accuracy and efficiency. Thanks to our variable discretization, the embedding features optimized by IMSVD offer unique explainability at the variable level. IMSVD has the potential to be adapted to other learning paradigms. Our code is publicly available at https://github.com/niuchuangnn/IMSVD.

Information-Maximized Soft Variable Discretization for Self-Supervised Image Representation Learning

TL;DR

This work tackles the challenge of effective self-supervised image representation learning by directly optimizing information measures over discretized latent variables. It introduces Information-Maximized Soft Variable Discretization (IMSVD), a non-hard-discretization SSL framework that softly quantizes latent variables and estimates their distributions to compute information-based objectives. A joint-cross entropy loss with a theoretical IMSVD theorem drives one-hot, transform-invariant, and redundancy-minimized embeddings, while enabling a discriminative, contrastive-like behavior without negative samples. Empirically, IMSVD achieves competitive or superior accuracy and improved efficiency on ImageNet linear evaluation, KNN classification, and transfer tasks, with interpretable discrete features and robust performance across training settings.

Abstract

Self-supervised learning (SSL) has emerged as a crucial technique in image processing, encoding, and understanding, especially for developing today's vision foundation models that utilize large-scale datasets without annotations to enhance various downstream tasks. This study introduces a novel SSL approach, Information-Maximized Soft Variable Discretization (IMSVD), for image representation learning. Specifically, IMSVD softly discretizes each variable in the latent space, enabling the estimation of their probability distributions over training batches and allowing the learning process to be directly guided by information measures. Motivated by the MultiView assumption, we propose an information-theoretic objective function to learn transform-invariant, non-travail, and redundancy-minimized representation features. We then derive a joint-cross entropy loss function for self-supervised image representation learning, which theoretically enjoys superiority over the existing methods in reducing feature redundancy. Notably, our non-contrastive IMSVD method statistically performs contrastive learning. Extensive experimental results demonstrate the effectiveness of IMSVD on various downstream tasks in terms of both accuracy and efficiency. Thanks to our variable discretization, the embedding features optimized by IMSVD offer unique explainability at the variable level. IMSVD has the potential to be adapted to other learning paradigms. Our code is publicly available at https://github.com/niuchuangnn/IMSVD.
Paper Structure (25 sections, 1 theorem, 11 equations, 5 figures, 9 tables)

This paper contains 25 sections, 1 theorem, 11 equations, 5 figures, 9 tables.

Key Result

Theorem 1

If the cross-joint entropy loss function in Eq. eq_loss is minimized, we have: for $\forall i, m, d, {\bm{q}}'_i(m,d) = {\bm{q}}"_i(m,d)$ are one-hot vectors, ${\bm{p}}(m,d)=\frac{1}{D_M}$, $\forall m_1, m_2, d_1, d_2, m_1 \ne m_2, {\bm{P}}(m_1, m_2; d_1, d_2) = {\bm{P}}^c(m_1, m_2; d_1, d_2) = \fra

Figures (5)

  • Figure 1: Illustration of discrete variables for encoding images. (a) The feature vector is statistically optimized to be a set of discrete variables ($v_1$, ..., $v_M$) as shown in different colors. Different variables are associated with diverse attributes; e.g., $v_1, v_2, v_M$ represent object part, text, and shape, respectively. Each variable $v_m$ is quantized with a set of discrete values represented by a one-hot vector ${\bm{q}}(m, :)$. The example images are selected with the practically optimized vectors in the same way as described in Sec. \ref{['sec_vis']}. (b) Specific images are encoded with different combinations of the one-hot vectors; e.g., IMG-1 is encoded with $v_1 = [1, 0, \cdots, 0]$ representing the head part, $v_2 = [1, 0, \cdots, 0]$ representing the dots texture, and $v_M = [1, 0, \cdots, 0]$ representing the round shape.
  • Figure 2: SSL framework through IMSVD optimized with the joint entropy loss. For illustration purposes, the embedding feature vector only consists of four variables and each variable is discretized into four units.
  • Figure 3: Visualization of IMSVD. (a) Cross-joint probability matrix, where blue and yellow respectively represent small and large values, and only the first five variables are shown for clear visualization. Note that this is computed on the whole ImageNet train set. (b) two transformations of the same image, and (c) Embedding vectors corresponding to the images in (b), where only the first ten variables are shown for clear visualization. Although we only show a single case for (b) and (c), readers can check more cases using our provided codes and models.
  • Figure 4: Visualization of learned IMSVD features on ImageNet validation set. The left side shows the samples assigned to the features indexed by 6, 8, 24, and 25 of the first variable. The right side shows the samples assigned to the features indexed by 88, 124, 134, and 138 of the second variable. Although we only show several cases, readers can check more cases using our provided codes and models.
  • Figure 5: Local feature visualization using Grad-CAM. Here the Grad-CAM maps show the local features of the learned variables, where the same set of images are used as in Fig. \ref{['fig:samples']}.

Theorems & Definitions (1)

  • Theorem 1: IMSVD Theorem