Table of Contents
Fetching ...

Separating common from salient patterns with Contrastive Representation Learning

Robin Louiset, Edouard Duchesnay, Antoine Grigis, Pietro Gori

TL;DR

This paper tackles the problem of disentangling common factors shared by background and target datasets from salient, target-specific factors. It proposes SepCLR, a Contrastive Analysis method formulated under the InfoMax principle, introducing three mutual information terms I(x;c), I(y;c), and I(y;s) and constraining background representations to an information-less salient vector while enforcing independence I(c,s)=0. The approach uses two encoders to map inputs into a common space C and a salient space S, with MI terms estimated via alignment/uniformity losses and a KDE-based joint entropy estimator (k-JEM) to maximize H(c,s) without assuming a particular pdf. Experiments on vision and medical imaging datasets demonstrate enhanced separation of common vs. salient patterns compared to CA-VAE baselines, and the method includes supervised extensions when attributes are available. The work provides a scalable, InfoMax–driven framework for CA with practical estimators and opens paths for interpretability via targeted attribute disentanglement.

Abstract

Contrastive Analysis is a sub-field of Representation Learning that aims at separating common factors of variation between two datasets, a background (i.e., healthy subjects) and a target (i.e., diseased subjects), from the salient factors of variation, only present in the target dataset. Despite their relevance, current models based on Variational Auto-Encoders have shown poor performance in learning semantically-expressive representations. On the other hand, Contrastive Representation Learning has shown tremendous performance leaps in various applications (classification, clustering, etc.). In this work, we propose to leverage the ability of Contrastive Learning to learn semantically expressive representations well adapted for Contrastive Analysis. We reformulate it under the lens of the InfoMax Principle and identify two Mutual Information terms to maximize and one to minimize. We decompose the first two terms into an Alignment and a Uniformity term, as commonly done in Contrastive Learning. Then, we motivate a novel Mutual Information minimization strategy to prevent information leakage between common and salient distributions. We validate our method, called SepCLR, on three visual datasets and three medical datasets, specifically conceived to assess the pattern separation capability in Contrastive Analysis. Code available at https://github.com/neurospin-projects/2024_rlouiset_sep_clr.

Separating common from salient patterns with Contrastive Representation Learning

TL;DR

This paper tackles the problem of disentangling common factors shared by background and target datasets from salient, target-specific factors. It proposes SepCLR, a Contrastive Analysis method formulated under the InfoMax principle, introducing three mutual information terms I(x;c), I(y;c), and I(y;s) and constraining background representations to an information-less salient vector while enforcing independence I(c,s)=0. The approach uses two encoders to map inputs into a common space C and a salient space S, with MI terms estimated via alignment/uniformity losses and a KDE-based joint entropy estimator (k-JEM) to maximize H(c,s) without assuming a particular pdf. Experiments on vision and medical imaging datasets demonstrate enhanced separation of common vs. salient patterns compared to CA-VAE baselines, and the method includes supervised extensions when attributes are available. The work provides a scalable, InfoMax–driven framework for CA with practical estimators and opens paths for interpretability via targeted attribute disentanglement.

Abstract

Contrastive Analysis is a sub-field of Representation Learning that aims at separating common factors of variation between two datasets, a background (i.e., healthy subjects) and a target (i.e., diseased subjects), from the salient factors of variation, only present in the target dataset. Despite their relevance, current models based on Variational Auto-Encoders have shown poor performance in learning semantically-expressive representations. On the other hand, Contrastive Representation Learning has shown tremendous performance leaps in various applications (classification, clustering, etc.). In this work, we propose to leverage the ability of Contrastive Learning to learn semantically expressive representations well adapted for Contrastive Analysis. We reformulate it under the lens of the InfoMax Principle and identify two Mutual Information terms to maximize and one to minimize. We decompose the first two terms into an Alignment and a Uniformity term, as commonly done in Contrastive Learning. Then, we motivate a novel Mutual Information minimization strategy to prevent information leakage between common and salient distributions. We validate our method, called SepCLR, on three visual datasets and three medical datasets, specifically conceived to assess the pattern separation capability in Contrastive Analysis. Code available at https://github.com/neurospin-projects/2024_rlouiset_sep_clr.
Paper Structure (44 sections, 47 equations, 10 figures, 21 tables)

This paper contains 44 sections, 47 equations, 10 figures, 21 tables.

Figures (10)

  • Figure 1: SepCLR is trained to identify and separate the salient patterns (color variations) of the target dataset $Y$ from the common patterns (shape) shared between background $X$ and target dataset $Y$. Views (transformations $t(\cdot)$) of both datasets are fed to two different encoders, one for the salient space ($f_{\theta_s}$) and one for the common space ($f_{\theta_c}$). In the hyperspherical common space, $C$, embeddings of views of the same image (from both $X$ and $Y$) are aligned, while embeddings from different images are repelled ($\text{max } I(c;x) + I(c;y)$). This enforces $C$ to represent the shared patterns (shape). In the salient space $S$, which is a Euclidean space, in order not to capture background variability (i.e: shape), background embeddings are aligned onto an information-less null vector s' ($D_{KL}(s_x || \delta(s'))=0$). Furthermore, embeddings of views of the same image (only from $Y$) are aligned while embeddings from different images are pushed away from each other, and they are all repelled from s' ($\text{max } I(s;y)$). This enforces $S$ to capture only the salient patterns of $Y$ (color). To limit the information leakage between $C$ and $S$, their MI is constrained to be null, i.e: $I(c;s=0)$.
  • Figure 2: Qualitative results on attribute-supervised SepCLR
  • Figure 3: Illustration of the dSprites dataset and its different independent variability factors: shape, zoom, rotation, Y position, and X position.
  • Figure 4: the Superimposed MNIST digits on CIFAR background dataset. Target images are CIFAR-10 images overlaid with an MNIST digit. Background images are CIFAR-10 images.
  • Figure 5: Celeba accessories dataset. The upper row consists of background images. The lower row shows target images.
  • ...and 5 more figures