Covariant Gradient Descent
Dmitry Guskov, Vitaly Vanchurin
TL;DR
This work introduces covariant gradient descent (CGD), a manifestly covariant formulation of optimization that remains consistent under arbitrary coordinate changes and in curved trainable spaces. By deriving the dynamics from a covariant force $F_\nu(t)$ and a covariant metric $g_{\mu\nu}(t)$ built from first- and second-m gradient moments, CGD unifies SGD, RMSProp, Adam, and AdaBelief as special cases and enables generalizations via full covariance. Moments are estimated with exponential moving averages, preserving linear per-iteration complexity. Empirical results on Rosenbrock and a multiplication task show that diagonal CGD and especially full CGD can surpass traditional optimizers, with the full covariance approach focusing updates in a progressively lower-dimensional subspace; the work also discusses computational challenges and future directions in leveraging off-diagonal gradient information.
Abstract
We present a manifestly covariant formulation of the gradient descent method, ensuring consistency across arbitrary coordinate systems and general curved trainable spaces. The optimization dynamics is defined using a covariant force vector and a covariant metric tensor, both computed from the first and second statistical moments of the gradients. These moments are estimated through time-averaging with an exponential weight function, which preserves linear computational complexity. We show that commonly used optimization methods such as RMSProp, Adam and AdaBelief correspond to special limits of the covariant gradient descent (CGD) and demonstrate how these methods can be further generalized and improved.
