Implicit Bias of Gradient Descent for Non-Homogeneous Deep Networks
Yuhang Cai, Kangjie Zhou, Jingfeng Wu, Song Mei, Michael Lindsey, Peter L. Bartlett
TL;DR
This work extends the theory of gradient descent implicit bias from homogeneous networks to a broad class of non-homogeneous deep nets by introducing a near-homogeneity framework and a strong separability condition. It proves that gradient flow induces a nearly monotone normalized margin, with iterates diverging in norm but converging in direction, and that the limiting direction satisfies the KKT conditions of a margin-maximization problem for the homogenized network. The results apply to architectures with residual connections and non-homogeneous activations, and they extend to gradient descent with large steps via corresponding margin- and homogenization-based arguments, supported by a two-layer network example. Overall, the paper resolves an open problem by showing that implicit bias phenomena extend beyond homogeneous networks under natural near-homogeneity assumptions, shaping our understanding of GD generalization in realistic architectures.
Abstract
We establish the asymptotic implicit bias of gradient descent (GD) for generic non-homogeneous deep networks under exponential loss. Specifically, we characterize three key properties of GD iterates starting from a sufficiently small empirical risk, where the threshold is determined by a measure of the network's non-homogeneity. First, we show that a normalized margin induced by the GD iterates increases nearly monotonically. Second, we prove that while the norm of the GD iterates diverges to infinity, the iterates themselves converge in direction. Finally, we establish that this directional limit satisfies the Karush-Kuhn-Tucker (KKT) conditions of a margin maximization problem. Prior works on implicit bias have focused exclusively on homogeneous networks; in contrast, our results apply to a broad class of non-homogeneous networks satisfying a mild near-homogeneity condition. In particular, our results apply to networks with residual connections and non-homogeneous activation functions, thereby resolving an open problem posed by Ji and Telgarsky (2020).
