Table of Contents
Fetching ...

Non-negative Contrastive Learning

Yifei Wang, Qi Zhang, Yaoyu Guo, Yisen Wang

TL;DR

Non-negative Contrastive Learning (NCL) imposes non-negativity on contrastive features to yield interpretable, sparse, and disentangled representations. It is shown to be theoretically equivalent to a non-negative matrix factorization objective and accompanied by identifiability and downstream-generalization guarantees, with a simple reparameterization (e.g., ReLU) that preserves CL performance. Empirically, NCL improves feature disentanglement, enables effective feature selection, and enhances downstream classification, including out-of-distribution robustness, while naturally extending to supervised and multi-modal settings via Non-negative Cross Entropy (NCE) and MMNCL. Overall, NCL offers a principled, scalable path to interpretable representations and broad applicability across SSL, supervised, and multi-modal learning.

Abstract

Deep representations have shown promising performance when transferred to downstream tasks in a black-box manner. Yet, their inherent lack of interpretability remains a significant challenge, as these features are often opaque to human understanding. In this paper, we propose Non-negative Contrastive Learning (NCL), a renaissance of Non-negative Matrix Factorization (NMF) aimed at deriving interpretable features. The power of NCL lies in its enforcement of non-negativity constraints on features, reminiscent of NMF's capability to extract features that align closely with sample clusters. NCL not only aligns mathematically well with an NMF objective but also preserves NMF's interpretability attributes, resulting in a more sparse and disentangled representation compared to standard contrastive learning (CL). Theoretically, we establish guarantees on the identifiability and downstream generalization of NCL. Empirically, we show that these advantages enable NCL to outperform CL significantly on feature disentanglement, feature selection, as well as downstream classification tasks. At last, we show that NCL can be easily extended to other learning scenarios and benefit supervised learning as well. Code is available at https://github.com/PKU-ML/non_neg.

Non-negative Contrastive Learning

TL;DR

Non-negative Contrastive Learning (NCL) imposes non-negativity on contrastive features to yield interpretable, sparse, and disentangled representations. It is shown to be theoretically equivalent to a non-negative matrix factorization objective and accompanied by identifiability and downstream-generalization guarantees, with a simple reparameterization (e.g., ReLU) that preserves CL performance. Empirically, NCL improves feature disentanglement, enables effective feature selection, and enhances downstream classification, including out-of-distribution robustness, while naturally extending to supervised and multi-modal settings via Non-negative Cross Entropy (NCE) and MMNCL. Overall, NCL offers a principled, scalable path to interpretable representations and broad applicability across SSL, supervised, and multi-modal learning.

Abstract

Deep representations have shown promising performance when transferred to downstream tasks in a black-box manner. Yet, their inherent lack of interpretability remains a significant challenge, as these features are often opaque to human understanding. In this paper, we propose Non-negative Contrastive Learning (NCL), a renaissance of Non-negative Matrix Factorization (NMF) aimed at deriving interpretable features. The power of NCL lies in its enforcement of non-negativity constraints on features, reminiscent of NMF's capability to extract features that align closely with sample clusters. NCL not only aligns mathematically well with an NMF objective but also preserves NMF's interpretability attributes, resulting in a more sparse and disentangled representation compared to standard contrastive learning (CL). Theoretically, we establish guarantees on the identifiability and downstream generalization of NCL. Empirically, we show that these advantages enable NCL to outperform CL significantly on feature disentanglement, feature selection, as well as downstream classification tasks. At last, we show that NCL can be easily extended to other learning scenarios and benefit supervised learning as well. Code is available at https://github.com/PKU-ML/non_neg.
Paper Structure (42 sections, 8 theorems, 22 equations, 10 figures, 7 tables)

This paper contains 42 sections, 8 theorems, 22 equations, 10 figures, 7 tables.

Key Result

Theorem 1

As long as the unconstrained objective ${\mathcal{L}}$ only relies on pairwise Euclidean similarity (or distance), e.g., $f(x)^\top f(x')$, its solution $f^*(x)$ suffers from rotation symmetry.

Figures (10)

  • Figure 1: Feature visualization on semantic consistency (a-b) and sparsity (c-d) on CIFAR-10. The first two demonstrate top-activated samples along each feature dimension, where those of CL (a) often have distinct semantics along each dimension (column) (e.g., dears and airplanes), while those of NCL (b) have much better semantic consistency, indicating better feature disentanglement. Comparing (c) and (d), it is easy to see that NCL features enjoy much better sparsity than CL features with only a few activated dimensions ($<10\%$) per sample.
  • Figure 2: Relationship between different learning paradigms discussed in this work.
  • Figure 3: Comaprisons between contrastive learning (CL) and non-negative contrastive learning (NCL): a) class consistency rate, measuring the proportion of activated samples that belong to their most frequent class along each feature dimension; b) feature sparsity, the average proportion of zero elements ($|x|<1e^{-5}$) in the features of each test sample; c) dimensional correlation matrix $C$ of 20 random features: $\forall (i,j),C_{ij}=\mathbb{E}_x\tilde{f}_i(x)^\top\tilde{f}_j(x)$, where $\tilde{f}_i(x)=f_i(x)/\sqrt{\sum_x \left(f_i(x)\right)^2}$.
  • Figure 4: Training from scratch with CE and NCE (w/o projector) on ImageNet-100.
  • Figure 5: Comparison between spectral contrastive learning (SCL) and non-negative contrastive learning (NCL) on CIFAR-10: a) linear probing accuracy; b) class consistency rate, measuring the proportion of activated samples that belong to their most frequent class along each feature dimension; c) feature sparsity, the average proportion of zero elements ($|x|<1e^{-5}$) in the features of each test sample; d) dimensional correlation matrix $C$ of 20 random features: $\forall (i,j),C_{ij}=\mathbb{E}_x\tilde{f}_i(x)^\top\tilde{f}_j(x)$, where $\tilde{f}_i(x)=f_i(x)/\sqrt{\sum_x \left(f_i(x)\right)^2}$.
  • ...and 5 more figures

Theorems & Definitions (16)

  • Theorem 1
  • proof
  • Theorem 2
  • Theorem 3
  • Theorem 4: Optimal representations under one-hot latent labels
  • Definition 1: Feature Identifiability
  • Theorem 5
  • Theorem 6
  • proof
  • proof
  • ...and 6 more