When Does Margin Clamping Affect Training Variance? Dataset-Dependent Effects in Contrastive Forward-Forward Learning

Joshua Steier

When Does Margin Clamping Affect Training Variance? Dataset-Dependent Effects in Contrastive Forward-Forward Learning

Joshua Steier

TL;DR

It is proved that an alternative formulation, subtracting the margin after the log-probability, is gradient-neutral under the mean-over-positives reduction, which would remove variance inflation in moderate-accuracy regimes with many same-class pairs per batch.

Abstract

Contrastive Forward-Forward (CFF) learning trains Vision Transformers layer by layer against supervised contrastive objectives. CFF training can be sensitive to random seed, but the sources of this instability are poorly understood. We focus on one implementation detail: the positive-pair margin in the contrastive loss is applied through saturating similarity clamping, $\min(s + m,\, 1)$. We prove that an alternative formulation, subtracting the margin after the log-probability, is gradient-neutral under the mean-over-positives reduction. On CIFAR-10 ($2 \times 2$ factorial, $n{=}7$ seeds per cell), clamping produces $5.90\times$ higher pooled test-accuracy variance ($p{=}0.003$) with no difference in mean accuracy. Analyses of clamp activation rates, layerwise gradient norms, and a reduced-margin probe point to saturation-driven gradient truncation at early layers. The effect does not transfer cleanly to other datasets: on CIFAR-100, SVHN, and Fashion-MNIST, clamping produces equal or lower variance. Two factors account for the discrepancy. First, positive-pair density per batch controls how often saturation occurs. Second, task difficulty compresses seed-to-seed spread when accuracy is high. An SVHN difficulty sweep confirms the interaction on a single dataset, with the variance ratio moving from $0.25\times$ at high accuracy to $16.73\times$ under aggressive augmentation. In moderate-accuracy regimes with many same-class pairs per batch, switching to the gradient-neutral subtraction reference removes this variance inflation at no cost to mean accuracy. Measuring the layer-0 clamp activation rate serves as a simple check for whether the problem applies.

When Does Margin Clamping Affect Training Variance? Dataset-Dependent Effects in Contrastive Forward-Forward Learning

TL;DR

Abstract

. We prove that an alternative formulation, subtracting the margin after the log-probability, is gradient-neutral under the mean-over-positives reduction. On CIFAR-10 (

factorial,

seeds per cell), clamping produces

higher pooled test-accuracy variance (

) with no difference in mean accuracy. Analyses of clamp activation rates, layerwise gradient norms, and a reduced-margin probe point to saturation-driven gradient truncation at early layers. The effect does not transfer cleanly to other datasets: on CIFAR-100, SVHN, and Fashion-MNIST, clamping produces equal or lower variance. Two factors account for the discrepancy. First, positive-pair density per batch controls how often saturation occurs. Second, task difficulty compresses seed-to-seed spread when accuracy is high. An SVHN difficulty sweep confirms the interaction on a single dataset, with the variance ratio moving from

at high accuracy to

under aggressive augmentation. In moderate-accuracy regimes with many same-class pairs per batch, switching to the gradient-neutral subtraction reference removes this variance inflation at no cost to mean accuracy. Measuring the layer-0 clamp activation rate serves as a simple check for whether the problem applies.

Paper Structure (95 sections, 1 theorem, 14 equations, 2 figures, 23 tables)

This paper contains 95 sections, 1 theorem, 14 equations, 2 figures, 23 tables.

Introduction
Focus.
Main results.
Contributions.
Background and Related Work
Forward-Forward and layer-local learning
Contrastive and supervised contrastive learning
Margins in metric and contrastive learning
Training variance and reproducibility
Gradient truncation in optimization
Problem Setup
Primary endpoint.
Experimental factors.
Diagnostic endpoints.
Methods
...and 80 more sections

Key Result

Proposition 4.1

Let $m_\ell \geq 0$ be a fixed scalar (not a function of model parameters). Under the mean-over-positives loss eq:supcon_loss, replacing $\log p_{uv,\ell}$ with $\log \tilde{p}_{uv,\ell}$ from eq:subtract_margin shifts each per-anchor term by the constant $m_\ell$ and leaves all gradients with respe

Figures (2)

Figure 1: Per-seed test accuracy by margin type (standard margin, pooled). Each dot is one seed ($n{=}14$ per group). Horizontal lines show group means; shaded bands show $\pm 1$ standard deviation.
Figure 2: Diagnostic profiles by layer (CIFAR-10). (a) CAR under standard and low-margin schedules. (b) Gradient $\ell_2$ norms at the final epoch. The gradient norm gap between conditions tracks the CAR profile, closing by layer 2.

Theorems & Definitions (4)

Definition 4.1: Concatenated view indexing
Definition 4.2: Stop-gradient
Proposition 4.1: Post-log-probability subtraction is gradient-neutral
proof

When Does Margin Clamping Affect Training Variance? Dataset-Dependent Effects in Contrastive Forward-Forward Learning

TL;DR

Abstract

When Does Margin Clamping Affect Training Variance? Dataset-Dependent Effects in Contrastive Forward-Forward Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (4)