Deep Learning as a Convex Paradigm of Computation: Minimizing Circuit Size with ResNets
Arthur Jacot
TL;DR
The paper investigates why deep networks generalize so well by connecting real-valued function computation to circuit-size minimization. It introduces the HTMC norm $||f||_{H^{\gamma}}$ (for $\gamma>2$) and a ResNet-based complexity $||f||_{R^{\omega}}$, then proves a sandwich bound that links these two notions, suggesting that DNN optimization effectively performs near-minimal circuit-size search in a convex function-space regime. A key contribution is the HTMC convexity result and the construction of Tetrakis functions, which approximate HTMC ball vertices and enable a constructive RHS bound via ResNets. The work also provides PAC generalization guarantees in terms of the HTMC norm and formalizes a practical pathway to convex optimization for circuit-size minimization through ResNet architectures. Overall, the results offer a principled framework to view DNN training as implicitly solving minimal-circuit problems, with potential implications for convergence proofs and compositional learning.
Abstract
This paper argues that DNNs implement a computational Occam's razor -- finding the `simplest' algorithm that fits the data -- and that this could explain their incredible and wide-ranging success over more traditional statistical methods. We start with the discovery that the set of real-valued function $f$ that can be $ε$-approximated with a binary circuit of size at most $cε^{-γ}$ becomes convex in the `Harder than Monte Carlo' (HTMC) regime, when $γ>2$, allowing for the definition of a HTMC norm on functions. In parallel one can define a complexity measure on the parameters of a ResNets (a weighted $\ell_1$ norm of the parameters), which induce a `ResNet norm' on functions. The HTMC and ResNet norms can then be related by an almost matching sandwich bound. Thus minimizing this ResNet norm is equivalent to finding a circuit that fits the data with an almost minimal number of nodes (within a power of 2 of being optimal). ResNets thus appear as an alternative model for computation of real functions, better adapted to the HTMC regime and its convexity.
