Non-negative Contrastive Learning

Yifei Wang; Qi Zhang; Yaoyu Guo; Yisen Wang

Non-negative Contrastive Learning

Yifei Wang, Qi Zhang, Yaoyu Guo, Yisen Wang

TL;DR

Non-negative Contrastive Learning (NCL) imposes non-negativity on contrastive features to yield interpretable, sparse, and disentangled representations. It is shown to be theoretically equivalent to a non-negative matrix factorization objective and accompanied by identifiability and downstream-generalization guarantees, with a simple reparameterization (e.g., ReLU) that preserves CL performance. Empirically, NCL improves feature disentanglement, enables effective feature selection, and enhances downstream classification, including out-of-distribution robustness, while naturally extending to supervised and multi-modal settings via Non-negative Cross Entropy (NCE) and MMNCL. Overall, NCL offers a principled, scalable path to interpretable representations and broad applicability across SSL, supervised, and multi-modal learning.

Abstract

Deep representations have shown promising performance when transferred to downstream tasks in a black-box manner. Yet, their inherent lack of interpretability remains a significant challenge, as these features are often opaque to human understanding. In this paper, we propose Non-negative Contrastive Learning (NCL), a renaissance of Non-negative Matrix Factorization (NMF) aimed at deriving interpretable features. The power of NCL lies in its enforcement of non-negativity constraints on features, reminiscent of NMF's capability to extract features that align closely with sample clusters. NCL not only aligns mathematically well with an NMF objective but also preserves NMF's interpretability attributes, resulting in a more sparse and disentangled representation compared to standard contrastive learning (CL). Theoretically, we establish guarantees on the identifiability and downstream generalization of NCL. Empirically, we show that these advantages enable NCL to outperform CL significantly on feature disentanglement, feature selection, as well as downstream classification tasks. At last, we show that NCL can be easily extended to other learning scenarios and benefit supervised learning as well. Code is available at https://github.com/PKU-ML/non_neg.

Non-negative Contrastive Learning

TL;DR

Abstract

Paper Structure (42 sections, 8 theorems, 22 equations, 10 figures, 7 tables)

This paper contains 42 sections, 8 theorems, 22 equations, 10 figures, 7 tables.

Introduction
Background on Contrastive Learning
Limitations in Representation Symmetry
Non-negative Contrastive Learning
Benefits of Non-negativity: Consistency, Sparsity, and Orthogonality
Theoretical Properties of Non-negative Contrastive Learning
Assumptions
Optimal Representations
Feature Identifiability
Downstream Generalization
Applications
Feature Selection
Feature Disentanglement
Downstream Generalization
Extension to Broader Scenarios
...and 27 more sections

Key Result

Theorem 1

As long as the unconstrained objective ${\mathcal{L}}$ only relies on pairwise Euclidean similarity (or distance), e.g., $f(x)^\top f(x')$, its solution $f^*(x)$ suffers from rotation symmetry.

Figures (10)

Figure 1: Feature visualization on semantic consistency (a-b) and sparsity (c-d) on CIFAR-10. The first two demonstrate top-activated samples along each feature dimension, where those of CL (a) often have distinct semantics along each dimension (column) (e.g., dears and airplanes), while those of NCL (b) have much better semantic consistency, indicating better feature disentanglement. Comparing (c) and (d), it is easy to see that NCL features enjoy much better sparsity than CL features with only a few activated dimensions ($<10\%$) per sample.
Figure 2: Relationship between different learning paradigms discussed in this work.
Figure 3: Comaprisons between contrastive learning (CL) and non-negative contrastive learning (NCL): a) class consistency rate, measuring the proportion of activated samples that belong to their most frequent class along each feature dimension; b) feature sparsity, the average proportion of zero elements ($|x|<1e^{-5}$) in the features of each test sample; c) dimensional correlation matrix $C$ of 20 random features: $\forall (i,j),C_{ij}=\mathbb{E}_x\tilde{f}_i(x)^\top\tilde{f}_j(x)$, where $\tilde{f}_i(x)=f_i(x)/\sqrt{\sum_x \left(f_i(x)\right)^2}$.
Figure 4: Training from scratch with CE and NCE (w/o projector) on ImageNet-100.
Figure 5: Comparison between spectral contrastive learning (SCL) and non-negative contrastive learning (NCL) on CIFAR-10: a) linear probing accuracy; b) class consistency rate, measuring the proportion of activated samples that belong to their most frequent class along each feature dimension; c) feature sparsity, the average proportion of zero elements ($|x|<1e^{-5}$) in the features of each test sample; d) dimensional correlation matrix $C$ of 20 random features: $\forall (i,j),C_{ij}=\mathbb{E}_x\tilde{f}_i(x)^\top\tilde{f}_j(x)$, where $\tilde{f}_i(x)=f_i(x)/\sqrt{\sum_x \left(f_i(x)\right)^2}$.
...and 5 more figures

Theorems & Definitions (16)

Theorem 1
proof
Theorem 2
Theorem 3
Theorem 4: Optimal representations under one-hot latent labels
Definition 1: Feature Identifiability
Theorem 5
Theorem 6
proof
proof
...and 6 more

Non-negative Contrastive Learning

TL;DR

Abstract

Non-negative Contrastive Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (16)