Towards noise contrastive estimation with soft targets for conditional models

Johannes Hugger; Virginie Uhlmann

Towards noise contrastive estimation with soft targets for conditional models

Johannes Hugger, Virginie Uhlmann

TL;DR

This work tackles the mismatch between standard cross-entropy and soft-target training by introducing SoftTargetInfoNCE, a loss that integrates probabilistic targets into a noise-contrastive estimation framework. The authors provide a theoretically grounded derivation, practical approximations to enable scalable training, and analyses of the resulting gradient dynamics and MI implications. Empirical results on ImageNet, Tiny ImageNet, CIFAR-100, and a graph dataset show SoftTargetInfoNCE is competitive with soft-target cross-entropy and often outperforms hard-target baselines and vanilla InfoNCE, while offering improved calibration in some cases. The approach broadens the applicability of InfoNCE to supervised classification with label uncertainty and soft regularization, with a straightforward implementation and clear pathways for future extensions.

Abstract

Soft targets combined with the cross-entropy loss have shown to improve generalization performance of deep neural networks on supervised classification tasks. The standard cross-entropy loss however assumes data to be categorically distributed, which may often not be the case in practice. In contrast, InfoNCE does not rely on such an explicit assumption but instead implicitly estimates the true conditional through negative sampling. Unfortunately, it cannot be combined with soft targets in its standard formulation, hindering its use in combination with sophisticated training strategies. In this paper, we address this limitation by proposing a loss function that is compatible with probabilistic targets. Our new soft target InfoNCE loss is conceptually simple, efficient to compute, and can be motivated through the framework of noise contrastive estimation. Using a toy example, we demonstrate shortcomings of the categorical distribution assumption of cross-entropy, and discuss implications of sampling from soft distributions. We observe that soft target InfoNCE performs on par with strong soft target cross-entropy baselines and outperforms hard target NLL and InfoNCE losses on popular benchmarks, including ImageNet. Finally, we provide a simple implementation of our loss, geared towards supervised classification and fully compatible with deep classification models trained with cross-entropy.

Towards noise contrastive estimation with soft targets for conditional models

TL;DR

Abstract

Paper Structure (37 sections, 37 equations, 6 figures, 3 tables)

This paper contains 37 sections, 37 equations, 6 figures, 3 tables.

Introduction
Background and related work
Log loss, soft targets and the continuous categorical distribution
InfoNCE for supervised classification
Results
InfoNCE improves parameter estimation over negative log-likelihood in conditional models
InfoNCE with soft targets
Derivation.
Computational considerations.
Discussion.
InfoNCE with soft distributions
Effect of soft distributions on the MI lower bound.
Experiments
Loss implementation.
Classification accuracy
...and 22 more sections

Figures (6)

Figure 1: Parameter estimation quality of InfoNCE vs. negative log-likelihood in conditional density models with different degrees of mode alignment. The estimation error is reported as the KL divergence. The alignment degree corresponds to the angle between modes. $0\%$ alignment corresponds to orthogonal modes and $80\%$ corresponds to an angle of $18^\circ$.
Figure 2: Soft target InfoNCE PyTorch code. For clarity, this assumes training on one GPU. In the distributed case, the above has to be slightly adapted to gather (soft) targets from other GPUs to be used as additional negatives. For more details see the code repository.
Figure 3: Noise sample (left) and label smoothing (right) ablation experiments on Tiny ImageNet.
Figure 4: Reliability diagrams of ViT-B/16 trained on Tiny ImageNet (left) and CIFAR-100 (right).
Figure 5: Different degrees of mode alignments of datasets $X$ visualized using TSNE. The alignment degree corresponds to the angle between modes. $0 \%$ alignment corresponds to orthogonal modes and $80 \%$ corresponds to an angle of $18^\circ$.
...and 1 more figures

Towards noise contrastive estimation with soft targets for conditional models

TL;DR

Abstract

Towards noise contrastive estimation with soft targets for conditional models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)