Table of Contents
Fetching ...

When Does Margin Clamping Affect Training Variance? Dataset-Dependent Effects in Contrastive Forward-Forward Learning

Joshua Steier

TL;DR

It is proved that an alternative formulation, subtracting the margin after the log-probability, is gradient-neutral under the mean-over-positives reduction, which would remove variance inflation in moderate-accuracy regimes with many same-class pairs per batch.

Abstract

Contrastive Forward-Forward (CFF) learning trains Vision Transformers layer by layer against supervised contrastive objectives. CFF training can be sensitive to random seed, but the sources of this instability are poorly understood. We focus on one implementation detail: the positive-pair margin in the contrastive loss is applied through saturating similarity clamping, $\min(s + m,\, 1)$. We prove that an alternative formulation, subtracting the margin after the log-probability, is gradient-neutral under the mean-over-positives reduction. On CIFAR-10 ($2 \times 2$ factorial, $n{=}7$ seeds per cell), clamping produces $5.90\times$ higher pooled test-accuracy variance ($p{=}0.003$) with no difference in mean accuracy. Analyses of clamp activation rates, layerwise gradient norms, and a reduced-margin probe point to saturation-driven gradient truncation at early layers. The effect does not transfer cleanly to other datasets: on CIFAR-100, SVHN, and Fashion-MNIST, clamping produces equal or lower variance. Two factors account for the discrepancy. First, positive-pair density per batch controls how often saturation occurs. Second, task difficulty compresses seed-to-seed spread when accuracy is high. An SVHN difficulty sweep confirms the interaction on a single dataset, with the variance ratio moving from $0.25\times$ at high accuracy to $16.73\times$ under aggressive augmentation. In moderate-accuracy regimes with many same-class pairs per batch, switching to the gradient-neutral subtraction reference removes this variance inflation at no cost to mean accuracy. Measuring the layer-0 clamp activation rate serves as a simple check for whether the problem applies.

When Does Margin Clamping Affect Training Variance? Dataset-Dependent Effects in Contrastive Forward-Forward Learning

TL;DR

It is proved that an alternative formulation, subtracting the margin after the log-probability, is gradient-neutral under the mean-over-positives reduction, which would remove variance inflation in moderate-accuracy regimes with many same-class pairs per batch.

Abstract

Contrastive Forward-Forward (CFF) learning trains Vision Transformers layer by layer against supervised contrastive objectives. CFF training can be sensitive to random seed, but the sources of this instability are poorly understood. We focus on one implementation detail: the positive-pair margin in the contrastive loss is applied through saturating similarity clamping, . We prove that an alternative formulation, subtracting the margin after the log-probability, is gradient-neutral under the mean-over-positives reduction. On CIFAR-10 ( factorial, seeds per cell), clamping produces higher pooled test-accuracy variance () with no difference in mean accuracy. Analyses of clamp activation rates, layerwise gradient norms, and a reduced-margin probe point to saturation-driven gradient truncation at early layers. The effect does not transfer cleanly to other datasets: on CIFAR-100, SVHN, and Fashion-MNIST, clamping produces equal or lower variance. Two factors account for the discrepancy. First, positive-pair density per batch controls how often saturation occurs. Second, task difficulty compresses seed-to-seed spread when accuracy is high. An SVHN difficulty sweep confirms the interaction on a single dataset, with the variance ratio moving from at high accuracy to under aggressive augmentation. In moderate-accuracy regimes with many same-class pairs per batch, switching to the gradient-neutral subtraction reference removes this variance inflation at no cost to mean accuracy. Measuring the layer-0 clamp activation rate serves as a simple check for whether the problem applies.
Paper Structure (95 sections, 1 theorem, 14 equations, 2 figures, 23 tables)

This paper contains 95 sections, 1 theorem, 14 equations, 2 figures, 23 tables.

Key Result

Proposition 4.1

Let $m_\ell \geq 0$ be a fixed scalar (not a function of model parameters). Under the mean-over-positives loss eq:supcon_loss, replacing $\log p_{uv,\ell}$ with $\log \tilde{p}_{uv,\ell}$ from eq:subtract_margin shifts each per-anchor term by the constant $m_\ell$ and leaves all gradients with respe

Figures (2)

  • Figure 1: Per-seed test accuracy by margin type (standard margin, pooled). Each dot is one seed ($n{=}14$ per group). Horizontal lines show group means; shaded bands show $\pm 1$ standard deviation.
  • Figure 2: Diagnostic profiles by layer (CIFAR-10). (a) CAR under standard and low-margin schedules. (b) Gradient $\ell_2$ norms at the final epoch. The gradient norm gap between conditions tracks the CAR profile, closing by layer 2.

Theorems & Definitions (4)

  • Definition 4.1: Concatenated view indexing
  • Definition 4.2: Stop-gradient
  • Proposition 4.1: Post-log-probability subtraction is gradient-neutral
  • proof