Table of Contents
Fetching ...

Marginals Before Conditionals

Mihir Sahasrabudhe

TL;DR

A minimal task that isolates conditional learning in neural networks is constructed, a surjective map with K-fold ambiguity, resolved by a selector token z, producing a plateau at exactly log K, before acquiring the full conditional in a sharp, collective transition.

Abstract

We construct a minimal task that isolates conditional learning in neural networks: a surjective map with K-fold ambiguity, resolved by a selector token z, so H(A | B) = log K while H(A | B, z) = 0. The model learns the marginal P(A | B) first, producing a plateau at exactly log K, before acquiring the full conditional in a sharp, collective transition. The plateau has a clean decomposition: height = log K (set by ambiguity), duration = f(D) (set by dataset size D, not K). Gradient noise stabilizes the marginal solution: higher learning rates monotonically slow the transition (3.6* across a 7* η range at fixed throughput), and batch-size reduction delays escape, consistent with an entropic force opposing departure from the low-gradient marginal. Internally, a selector-routing head assembles during the plateau, leading the loss transition by ~50% of the waiting time. This is the Type 2 directional asymmetry of Papadopoulos et al. [2024], measured dynamically: we track the excess risk from log K to zero and characterize what stabilizes it, what triggers its collapse, and how long it takes.

Marginals Before Conditionals

TL;DR

A minimal task that isolates conditional learning in neural networks is constructed, a surjective map with K-fold ambiguity, resolved by a selector token z, producing a plateau at exactly log K, before acquiring the full conditional in a sharp, collective transition.

Abstract

We construct a minimal task that isolates conditional learning in neural networks: a surjective map with K-fold ambiguity, resolved by a selector token z, so H(A | B) = log K while H(A | B, z) = 0. The model learns the marginal P(A | B) first, producing a plateau at exactly log K, before acquiring the full conditional in a sharp, collective transition. The plateau has a clean decomposition: height = log K (set by ambiguity), duration = f(D) (set by dataset size D, not K). Gradient noise stabilizes the marginal solution: higher learning rates monotonically slow the transition (3.6* across a 7* η range at fixed throughput), and batch-size reduction delays escape, consistent with an entropic force opposing departure from the low-gradient marginal. Internally, a selector-routing head assembles during the plateau, leading the loss transition by ~50% of the waiting time. This is the Type 2 directional asymmetry of Papadopoulos et al. [2024], measured dynamically: we track the excess risk from log K to zero and characterize what stabilizes it, what triggers its collapse, and how long it takes.
Paper Structure (38 sections, 5 figures, 12 tables)

This paper contains 38 sections, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Staged disambiguation. Loss curves for $K \in \{5, 10, 20, 36\}$ at $n_b{=}1{,}000$. Dashed lines mark $\log K$. The model converges to the marginal solution within a few hundred steps, then plateaus until a sharp transition to near-zero loss.
  • Figure 2: Noise delays escape. (a) Batch-size effect: a modest $1.8\times$ delay in tokens processed (the $23\times$ step-count ratio is predominantly throughput), consistent with entropic stabilization ziyin2024sgd. (b) Label noise shows a $16\times$ delay in tokens, but also degrades the conditional signal (§\ref{['sec:entropic']}); the LR sweep ($3.6\times$ at constant throughput) provides the cleanest evidence.
  • Figure 3: Internal precursors. (a) $z$-dependence leads loss across all $K$. (b) Transition is onset of coherent descent.
  • Figure 4: $\tau$ versus $D$ on log-log axes. Blue circles: 10-point K-sweep at $n_b = 1{,}000$, reinterpreted as a D-sweep ($D = 1{,}000K$). Red squares: fixed-$D$ control ($D = 10$K and $20$K, four $K$ values each), confirming $K$-independence. Dashed line: power-law fit $\tau \propto D^{1.19}$.
  • Figure 5: Duration depends on $D$, not $K$. At fixed $D$, $\tau$ is flat across $K$. Doubling $D$ roughly doubles $\tau$.