Table of Contents
Fetching ...

On Discriminative Probabilistic Modeling for Self-Supervised Representation Learning

Bokun Wang, Yunwen Lei, Yiming Ying, Tianbao Yang

TL;DR

The paper addresses self-supervised representation learning when targets lie in a continuous domain by formulating discriminative probabilistic modeling with a continuous conditional density $p_{\mathbf{w}}(\mathbf{y}|\mathbf{x})$. It employs multi-importance sampling (MIS) to approximate the challenging partition function and shows InfoNCE-based losses are recoverable as a special case; to improve generalization, it introduces a non-parametric convex optimization to estimate the popularity measure $\mathbf{q}$, yielding a new contrastive objective. The proposed NUCLR algorithm optimizes this objective with an alternating scheme and margin-aware negatives, achieving superior performance on CC3M/CC12M benchmark tasks for image-language retrieval and classification. The work provides theoretical generalization insights and practical improvements for discriminative SSL in multimodal, continuous settings, with potential extensions to generative components and reduced memory overhead.

Abstract

We study the discriminative probabilistic modeling on a continuous domain for the data prediction task of (multimodal) self-supervised representation learning. To address the challenge of computing the integral in the partition function for each anchor data, we leverage the multiple importance sampling (MIS) technique for robust Monte Carlo integration, which can recover InfoNCE-based contrastive loss as a special case. Within this probabilistic modeling framework, we conduct generalization error analysis to reveal the limitation of current InfoNCE-based contrastive loss for self-supervised representation learning and derive insights for developing better approaches by reducing the error of Monte Carlo integration. To this end, we propose a novel non-parametric method for approximating the sum of conditional probability densities required by MIS through convex optimization, yielding a new contrastive objective for self-supervised representation learning. Moreover, we design an efficient algorithm for solving the proposed objective. We empirically compare our algorithm to representative baselines on the contrastive image-language pretraining task. Experimental results on the CC3M and CC12M datasets demonstrate the superior overall performance of our algorithm. Our code is available at https://github.com/bokun-wang/NUCLR.

On Discriminative Probabilistic Modeling for Self-Supervised Representation Learning

TL;DR

The paper addresses self-supervised representation learning when targets lie in a continuous domain by formulating discriminative probabilistic modeling with a continuous conditional density . It employs multi-importance sampling (MIS) to approximate the challenging partition function and shows InfoNCE-based losses are recoverable as a special case; to improve generalization, it introduces a non-parametric convex optimization to estimate the popularity measure , yielding a new contrastive objective. The proposed NUCLR algorithm optimizes this objective with an alternating scheme and margin-aware negatives, achieving superior performance on CC3M/CC12M benchmark tasks for image-language retrieval and classification. The work provides theoretical generalization insights and practical improvements for discriminative SSL in multimodal, continuous settings, with potential extensions to generative components and reduced memory overhead.

Abstract

We study the discriminative probabilistic modeling on a continuous domain for the data prediction task of (multimodal) self-supervised representation learning. To address the challenge of computing the integral in the partition function for each anchor data, we leverage the multiple importance sampling (MIS) technique for robust Monte Carlo integration, which can recover InfoNCE-based contrastive loss as a special case. Within this probabilistic modeling framework, we conduct generalization error analysis to reveal the limitation of current InfoNCE-based contrastive loss for self-supervised representation learning and derive insights for developing better approaches by reducing the error of Monte Carlo integration. To this end, we propose a novel non-parametric method for approximating the sum of conditional probability densities required by MIS through convex optimization, yielding a new contrastive objective for self-supervised representation learning. Moreover, we design an efficient algorithm for solving the proposed objective. We empirically compare our algorithm to representative baselines on the contrastive image-language pretraining task. Experimental results on the CC3M and CC12M datasets demonstrate the superior overall performance of our algorithm. Our code is available at https://github.com/bokun-wang/NUCLR.

Paper Structure

This paper contains 39 sections, 7 theorems, 67 equations, 8 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Suppose that Assumptions asm:bounded and asm:nn hold. Consider the prediction function $E_\mathbf{w}$ parameterized by $L$-layer deep neural networks $\mathbf{w}_1,\mathbf{w}_2$ and an approximation $\tilde{\mathbf{q}}$ of $\mathbf{q}$, where $q^{(j)} = \sum_{j'=1}^n p(\mathbf{y}_j\mid \mathbf{x}_{j where $\mathcal{E}_\mathbf{w}(\tilde{\mathbf{q}},\mathbf{q};\hat{\mathbf{S}})\coloneqq \frac{1}{n}\

Figures (8)

  • Figure 1: DPM for supervised learning and self-supervised representation learning.
  • Figure 2: Left: Illustration of spaces $\mathcal{X}$ and $\mathcal{Y}$; Middle: RBF interpolated heatmaps of the true $\mathbf{q}$ and approximation $\tilde{\mathbf{q}}$ when $n=100$; Right: Comparing the generalization error $|\hat{\mathcal{L}}(\tilde{\mathbf{q}},\hat{\mathbf{S}}) - \mathcal{L}|$ of our method and GCL across various $n$. "MLE" refers to the MLE objective in \ref{['eq:mle']} with the exact partition function.
  • Figure 3: Validation Recall@1 performance of our algorithm and baseline SogCLR during training on the CC3M (left two columns) and CC12M datasets (right two columns).
  • Figure 4: Compare the NUCLR algorithm with variants NUCLR-$\dagger$, NUCLR-$\diamondsuit$, and NUCLR-$\clubsuit$. "Downstream Retrieval" refers to the average test recall@1 on MSCOCO and Flickr30k datasets; "Downstream Classification" refers to the average test top-1 accuracy on CIFAR100 and ImageNet1k datasets.
  • Figure 5: Examples of CC3M images with large (in red) and small learned popularities $\tilde{q}'$ (in blue).
  • ...and 3 more figures

Theorems & Definitions (14)

  • Remark 1
  • Remark 2
  • Remark 3
  • Theorem 1
  • Remark 4
  • Theorem 2
  • Remark 5
  • Proposition 1
  • proof
  • Lemma 1: Contraction Lemma, Thm 11.6 in boucheron2013concentration
  • ...and 4 more