Table of Contents
Fetching ...

Align Forward, Adapt Backward: Closing the Discretization Gap in Logic Gate Networks

Youngsung Kim

Abstract

In neural network models, soft mixtures of fixed candidate components (e.g., logic gates and sub-networks) are often used during training for stable optimization, while hard selection is typically used at inference. This raises questions about training-inference mismatch. We analyze this gap by separating forward-pass computation (hard selection vs. soft mixture) from stochasticity (with vs. without Gumbel noise). Using logic gate networks as a testbed, we observe distinct behaviors across four methods: Hard-ST achieves zero selection gap by construction; Gumbel-ST achieves near-zero gap when training succeeds but suffers accuracy collapse at low temperatures; Soft-Mix achieves small gap only at low temperature via weight concentration; and Soft-Gumbel exhibits large gaps despite Gumbel noise, confirming that noise alone does not reduce the gap. We propose CAGE (Confidence-Adaptive Gradient Estimation) to maintain gradient flow while preserving forward alignment. On logic gate networks, Hard-ST with CAGE achieves over 98% accuracy on MNIST and over 58% on CIFAR-10, both with zero selection gap across all temperatures, while Gumbel-ST without CAGE suffers a 47-point accuracy collapse.

Align Forward, Adapt Backward: Closing the Discretization Gap in Logic Gate Networks

Abstract

In neural network models, soft mixtures of fixed candidate components (e.g., logic gates and sub-networks) are often used during training for stable optimization, while hard selection is typically used at inference. This raises questions about training-inference mismatch. We analyze this gap by separating forward-pass computation (hard selection vs. soft mixture) from stochasticity (with vs. without Gumbel noise). Using logic gate networks as a testbed, we observe distinct behaviors across four methods: Hard-ST achieves zero selection gap by construction; Gumbel-ST achieves near-zero gap when training succeeds but suffers accuracy collapse at low temperatures; Soft-Mix achieves small gap only at low temperature via weight concentration; and Soft-Gumbel exhibits large gaps despite Gumbel noise, confirming that noise alone does not reduce the gap. We propose CAGE (Confidence-Adaptive Gradient Estimation) to maintain gradient flow while preserving forward alignment. On logic gate networks, Hard-ST with CAGE achieves over 98% accuracy on MNIST and over 58% on CIFAR-10, both with zero selection gap across all temperatures, while Gumbel-ST without CAGE suffers a 47-point accuracy collapse.
Paper Structure (73 sections, 10 theorems, 39 equations, 5 figures, 14 tables, 1 algorithm)

This paper contains 73 sections, 10 theorems, 39 equations, 5 figures, 14 tables, 1 algorithm.

Key Result

Proposition 1

For $G_i \stackrel{\mathrm{iid}}{\sim} \mathrm{Gumbel}(0,1)$: This probability is temperature-independent since $\arg\max$ is scale-invariant. (Proof sketch in Appendix proof:gumbel_max)

Figures (5)

  • Figure 1: Training-inference gap depends primarily on forward-pass structure.Left: Mixture methods (top) compute weighted sums, enabling hedging; hard selection methods (bottom) commit to a single gate, matching inference. Right-Top: Inference uses argmax selection. Right-Bottom: Peak selection gap ($2 \times 2$ factorial, MNIST-Binary, $L$=6, $\tau$=2.0). Hard methods achieve near-zero gap; soft methods show 24--40% gap, with Gumbel noise amplifying the gap.
  • Figure 2: Main results (MNIST-Binary, $L$=6, $\tau$=1.0 unless varied). (a) Selection gap during training: Hard-ST achieves exactly zero; soft methods show non-zero gaps. (b) Test accuracy vs temperature: Hard-ST and CAGE remain stable; Gumbel-ST fails at low $\tau$. (c) Iterations to convergence: Hard-ST converges significantly faster than soft methods. (d) Gate commitment by layer: Hard-ST commits; Soft-Gumbel hedges, especially in early layers.
  • Figure 3: CAGE adaptation dynamics (MNIST-Binary, $L{=}6$). (a) Backward temperature over training: $\tau_b$ decreases from $\tau_{\max}$ as confidence rises. (b) EMA confidence over training: increases from uniform ($\sim$0.25) toward commitment ($\sim$0.9). Results averaged over $n{=}3$ random seeds.
  • Figure 4: Gap heatmaps on MNIST. Hard-ST shows uniform near-zero gap (light colors) across all conditions. Gumbel-ST exhibits catastrophic negative gap at low $\tau$ (dark blue = training failure where deployment outperforms training). CAGE eliminates this failure, restoring uniform small gap.
  • Figure 5: Gate usage distribution (MNIST-Binary, $L$=6, $\tau$=1.0). Hard-ST methods show pronounced XOR/XNOR preference ($\sim$11% each vs. $\sim$6% expected under uniform selection), while soft methods distribute usage more uniformly. This suggests hard selection encourages commitment to more expressive Boolean operations.

Theorems & Definitions (18)

  • Proposition 1: Gumbel-Max Theorem gumbel1954statistical
  • Corollary 2: Gumbel-ST: Selection Gap
  • Proposition 3: Hard-ST: Zero Selection Gap
  • Proposition 4: Mixture Methods: Selection Gap
  • Theorem 5: Forward Alignment Principle
  • Theorem 6: Unified Selection Gap Bound
  • Proposition 7: Loss Gap Bound
  • Proposition 8: Gradient Structure
  • Proposition 9: Competitive Gate Selection
  • Proposition 10: Convergence Scaling
  • ...and 8 more