Table of Contents
Fetching ...

Mind the Gap: Removing the Discretization Gap in Differentiable Logic Gate Networks

Shakir Yousefi, Andreas Plesner, Till Aczel, Roger Wattenhofer

TL;DR

This work tackles the discretization gap and slow training in differentiable logic gate networks by introducing Gumbel Logic Gate Networks, which inject Gumbel noise into gate selection and use a straight-through estimator to align training with inference. The authors prove that Gumbel perturbations implicitly regularize the Hessian trace, yielding smoother loss landscapes and faster convergence, while reducing sensitivity to discretization. Empirically, GLGNs outperform prior DLGNs on CIFAR-10/100, achieving up to 4.5× faster training, a 98% reduction in the discretization gap, and near-zero unused gates, with benefits that scale with depth. The results suggest that stochastic gate selection combined with backward-compatible discretization can substantially improve the practicality of differentiable LGNs for efficient image classification and broader NAS-like search spaces.

Abstract

Modern neural networks demonstrate state-of-the-art performance on numerous existing benchmarks; however, their high computational requirements and energy consumption prompt researchers to seek more efficient solutions for real-world deployment. Logic gate networks (LGNs) learns a large network of logic gates for efficient image classification. However, learning a network that can solve a simple problem like CIFAR-10 can take days to weeks to train. Even then, almost half of the network remains unused, causing a discretization gap. This discretization gap hinders real-world deployment of LGNs, as the performance drop between training and inference negatively impacts accuracy. We inject Gumbel noise with a straight-through estimator during training to significantly speed up training, improve neuron utilization, and decrease the discretization gap. We theoretically show that this results from implicit Hessian regularization, which improves the convergence properties of LGNs. We train networks $4.5 \times$ faster in wall-clock time, reduce the discretization gap by $98\%$, and reduce the number of unused gates by $100\%$.

Mind the Gap: Removing the Discretization Gap in Differentiable Logic Gate Networks

TL;DR

This work tackles the discretization gap and slow training in differentiable logic gate networks by introducing Gumbel Logic Gate Networks, which inject Gumbel noise into gate selection and use a straight-through estimator to align training with inference. The authors prove that Gumbel perturbations implicitly regularize the Hessian trace, yielding smoother loss landscapes and faster convergence, while reducing sensitivity to discretization. Empirically, GLGNs outperform prior DLGNs on CIFAR-10/100, achieving up to 4.5× faster training, a 98% reduction in the discretization gap, and near-zero unused gates, with benefits that scale with depth. The results suggest that stochastic gate selection combined with backward-compatible discretization can substantially improve the practicality of differentiable LGNs for efficient image classification and broader NAS-like search spaces.

Abstract

Modern neural networks demonstrate state-of-the-art performance on numerous existing benchmarks; however, their high computational requirements and energy consumption prompt researchers to seek more efficient solutions for real-world deployment. Logic gate networks (LGNs) learns a large network of logic gates for efficient image classification. However, learning a network that can solve a simple problem like CIFAR-10 can take days to weeks to train. Even then, almost half of the network remains unused, causing a discretization gap. This discretization gap hinders real-world deployment of LGNs, as the performance drop between training and inference negatively impacts accuracy. We inject Gumbel noise with a straight-through estimator during training to significantly speed up training, improve neuron utilization, and decrease the discretization gap. We theoretically show that this results from implicit Hessian regularization, which improves the convergence properties of LGNs. We train networks faster in wall-clock time, reduce the discretization gap by , and reduce the number of unused gates by .

Paper Structure

This paper contains 50 sections, 4 theorems, 30 equations, 13 figures, 4 tables, 1 algorithm.

Key Result

Lemma 1

Let $\mathcal{L}: \mathbb{R}^{16} \to \ \mathbb{R}$ be twice continuously differentiable (with Lipschitz Hessian), and let $\mathbf{z} \in \mathbb{R}^{16}, \mathbf{g} \sim\mathrm{Gumbel}(0,1)^{16}$. Consider and set $\mathbf{a} = \mathbf{z} / \tau$ and $f(\mathbf{a}) = \mathcal{L}(\mathrm{softmax}(\mathbf{a}))$. We then get the expression below.

Figures (13)

  • Figure 1: CIFAR-10 test accuracy comparison. Solid and dashed lines show discrete and soft performance, respectively. Gumbel LGNs (GLGN, red) demonstrate faster convergence and minimal discretization gap compared to DLGNs (DLGN, blue).
  • Figure 2: Overview figure. (a) DLGNs: Leftmost shows the internal structure of a node. We parameterize each node by weighing the 16 possible logic gates (shown below) and summing their output. This results in a hard, brittle, loss landscape, slowing convergence and increasing the discretization gap. (b) Gumbel LGNs: during training, we inject Gumbel noise on the 16 gate weights and select the logic gate with the highest weight. This results in a smoother loss landscape and aligns the training with the network at inference. This results in faster training and reduces the discretization gap. (c) Structure of DLGNs and Gumbel LGNs. Each neuron receives two inputs. The nodes in the final layer are aggregated by summation, thus producing class likelihoods. (d) Results of using Gumbel LGNs instead of DLGNs. We achieve up to $4.5\times$ faster convergence (in wall-clock time), a 98% reduction in discretization gap, and 100% reduction of unused neurons.
  • Figure 3: Performance of Gumbel LGNs and DLGNs on CIFAR-10. (Left) Test accuracy (width of $256$k, depth of 12). (Right) Discretization gap for various depths. DLGNs experience larger gaps and slower reduction as the depth increases. In contrast, Gumbel LGNs have consistently low gaps and fast reduction as the network depth increases.
  • Figure 4: Performance of Gumbel LGNs and DLGNs on CIFAR-100. (Left) Test accuracy (width of $256$k, depth of 12). (Right) Discretization gap for various depths. DLGNs experience larger gaps and slower reduction as the depth increases. In contrast, Gumbel LGNs have consistently low gaps and fast reduction as the network depth increases.
  • Figure 5: Test accuracy on CIFAR-10 for a shallow, wide network (width 2048k, depth 6).
  • ...and 8 more figures

Theorems & Definitions (8)

  • Lemma 1: Gumbel‐Smoothing
  • proof
  • Lemma 2: Translation-Invariance of Softmax
  • proof
  • Lemma 3: Gumbel‐Smoothing
  • proof
  • Lemma 4
  • proof