Learning Discretized Neural Networks under Ricci Flow

Jun Chen; Hanwen Chen; Mengmeng Wang; Guang Dai; Ivor W. Tsang; Yong Liu

Learning Discretized Neural Networks under Ricci Flow

Jun Chen, Hanwen Chen, Mengmeng Wang, Guang Dai, Ivor W. Tsang, Yong Liu

TL;DR

This work addresses gradient mismatch in discretized neural networks by recasting training as a metric perturbation problem on a Riemannian manifold and proposes a geometric remedy using Ricci flow. The authors construct a Linearly Nearly Euclidean (LNE) manifold via an information-geometric framework and show that Ricci flow exponentially dampens metric perturbations, enabling stable training of discretized networks. They then derive practical gradient computations on the evolving LNE manifold, introducing strong and weak approximations to invert the LNE metric and implement RF-DNNs with discrete Ricci flow. Empirical results on CIFAR and ImageNet demonstrate improved stability and accuracy over STE-based methods across bit-widths and architectures. Overall, the paper provides a theoretically grounded, geometry-driven approach to training low-precision DNNs with competitive performance and stability advantages.

Abstract

In this paper, we study Discretized Neural Networks (DNNs) composed of low-precision weights and activations, which suffer from either infinite or zero gradients due to the non-differentiable discrete function during training. Most training-based DNNs in such scenarios employ the standard Straight-Through Estimator (STE) to approximate the gradient w.r.t. discrete values. However, the use of STE introduces the problem of gradient mismatch, arising from perturbations in the approximated gradient. To address this problem, this paper reveals that this mismatch can be interpreted as a metric perturbation in a Riemannian manifold, viewed through the lens of duality theory. Building on information geometry, we construct the Linearly Nearly Euclidean (LNE) manifold for DNNs, providing a background for addressing perturbations. By introducing a partial differential equation on metrics, i.e., the Ricci flow, we establish the dynamical stability and convergence of the LNE metric with the $L^2$-norm perturbation. In contrast to previous perturbation theories with convergence rates in fractional powers, the metric perturbation under the Ricci flow exhibits exponential decay in the LNE manifold. Experimental results across various datasets demonstrate that our method achieves superior and more stable performance for DNNs compared to other representative training-based methods.

Learning Discretized Neural Networks under Ricci Flow

TL;DR

Abstract

-norm perturbation. In contrast to previous perturbation theories with convergence rates in fractional powers, the metric perturbation under the Ricci flow exhibits exponential decay in the LNE manifold. Experimental results across various datasets demonstrate that our method achieves superior and more stable performance for DNNs compared to other representative training-based methods.

Paper Structure (47 sections, 17 theorems, 95 equations, 7 figures, 5 tables, 2 algorithms)

This paper contains 47 sections, 17 theorems, 95 equations, 7 figures, 5 tables, 2 algorithms.

Introduction
Contributions
Overall Organization
Motivation and Formulation
Background
Motivation
Ricci Flow
Literature
Neural Networks in LNE Manifolds
Neural Network Manifold
Euclidean Space and Divergence
LNE Manifold and Divergence
Convex Function and Bregman Divergence
LNE Divergence and Gradient
Evolution of LNE Manifolds under Ricci Flow
...and 32 more sections

Key Result

Corollary 3

sheridan2006hamilton The Ricci flow is strongly parabolic if there exists $\delta > 0$ such that for all covectors $\varphi \neq 0$ and all (symmetricThe Riemannian metric $g_{ij}$ is always symmetric based on Definition def8. Hence, $h_{ij}=\frac{\partial}{\partial t} g_{ij}(t)$ is required to be s where $h^{ij}$ is the inverse of $h_{ij}$.

Figures (7)

Figure 1: Comparison of STE and our method. We denote the arrows and points as gradients and weights, respectively. In particular, when a point falls on the grid point, it means that the weight is discretized at this time. In the forward pass, the continuous weight $\boldsymbol{w}$ is mapped to a discrete weight $Q(\boldsymbol{w})$ via a discrete function. In the backward pass, the gradient is propagated from $\partial L /\partial Q(\boldsymbol{w})$ to $\partial L /\partial \boldsymbol{w}$. (a) The STE simply copies the gradient, i.e., $\partial L /\partial \boldsymbol{w}=\partial L /\partial Q(\boldsymbol{w})$. (b) Our method matches the gradient by introducing the proper metric $g_{\boldsymbol{w}}$, i.e., $\partial L /\partial \boldsymbol{w}=g^{-1}_{\boldsymbol{w}}\partial L /\partial Q(\boldsymbol{w})$ in a Riemannian manifold.
Figure 2: The overview of the theoretical ideas.
Figure 3: The divergence $D[\boldsymbol{\xi}:\boldsymbol{\xi}']$ is viewed as the distance between the convex function $\Phi(\boldsymbol{\xi})$ and its tangent hyperplane $z$, where the supporting hyperplane with normal vector $\boldsymbol{n}=\nabla \Phi(\boldsymbol{\xi}')$ at the point $\boldsymbol{\xi}'$ is defined.
Figure 4: The flow chart of strong approximation of $g^{-1}(\boldsymbol{w})$. The new entries $\tilde{P}$ and $\tilde{A}$ generated by the neural network constitute a matrix $\boldsymbol{G}$, which is multiplied by the metric $g(\boldsymbol{w})$. As the loss function, defined by Equation (\ref{['loss']}), decreases, the matrix $\boldsymbol{G}$ serves to approximate the inverse of the metric $g(\boldsymbol{w})$.
Figure 5: Upon feeding the original image into the neural network and performing a forward and backward pass on the linear layer to update the weights $\boldsymbol{w}$, we construct the metric structure $g(\boldsymbol{w})$ based on Section \ref{['5.1']}. Furthermore, we subject the original image to four distinct small translation transformations ($k_1$, $k_2$, $j_1$, and $j_2$) before inputting them into the neural network. By sequentially performing a forward and backward passes, we obtain four metric structures ($g|_{k_1}$, $g|_{k_2}$, $g|_{j_1}$, and $g|_{j_2}$) corresponding to these translations. The combination of these metrics allows us to characterize the Ricci curvature $\operatorname{Ric}(g)$.
...and 2 more figures

Theorems & Definitions (32)

Definition 1
Definition 2
Corollary 3
Theorem 4
Lemma 5
Remark 6
Definition 7
Definition 8
Remark 9
Theorem 10
...and 22 more

Learning Discretized Neural Networks under Ricci Flow

TL;DR

Abstract

Learning Discretized Neural Networks under Ricci Flow

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (32)