Group Entropies and Mirror Duality: A Class of Flexible Mirror Descent Updates for Machine Learning

Andrzej Cichocki; Piergiulio Tempesta

Group Entropies and Mirror Duality: A Class of Flexible Mirror Descent Updates for Machine Learning

Andrzej Cichocki, Piergiulio Tempesta

Abstract

We introduce a comprehensive theoretical and algorithmic framework that bridges formal group theory and group entropies with modern machine learning, paving the way for an infinite, flexible family of Mirror Descent (MD) optimization algorithms. Our approach exploits the rich structure of group entropies, which are generalized entropic functionals governed by group composition laws, encompassing and significantly extending all trace-form entropies such as the Shannon, Tsallis, and Kaniadakis families. By leveraging group-theoretical mirror maps (or link functions) in MD, expressed via multi-parametric generalized logarithms and their inverses (group exponentials), we achieve highly flexible and adaptable MD updates that can be tailored to diverse data geometries and statistical distributions. To this end, we introduce the notion of \textit{mirror duality}, which allows us to seamlessly switch or interchange group-theoretical link functions with their inverses, subject to specific learning rate constraints. By tuning or learning the hyperparameters of the group logarithms enables us to adapt the model to the statistical properties of the training distribution, while simultaneously ensuring desirable convergence characteristics via fine-tuning. This generality not only provides greater flexibility and improved convergence properties, but also opens new perspectives for applications in machine learning and deep learning by expanding the design of regularizers and natural gradient algorithms. We extensively evaluate the validity, robustness, and performance of the proposed updates on large-scale, simplex-constrained quadratic programming problems.

Group Entropies and Mirror Duality: A Class of Flexible Mirror Descent Updates for Machine Learning

Abstract

Paper Structure (40 sections, 8 theorems, 84 equations, 5 figures, 5 tables)

This paper contains 40 sections, 8 theorems, 84 equations, 5 figures, 5 tables.

Introduction
Preliminaries: Groups and Entropies.
Group Entropies: A Brief Review
Formal Groups: Main Definitions
Generalized Logarithms and Exponentials from Group Laws
Group Logarithms: Examples
The Super--Exponential Case
On the Relevance of Group Entropies in Machine Learning and Information Theory
Related work on Mirror Descent
Notations and Assumptions
The Bregman Divergence
The Optimization Problem
Mirror Descent
The Natural Gradient Descent (NGD)
Basic Mirror Descent Updates: Standard Exponentiated Gradient (EG)
...and 25 more sections

Key Result

Lemma 1

The relations hold, where $\chi=G_1\circ G_2^{-1}\cdots\circ G_{2n-1}\circ G_{2n}^{-1}$.

Figures (5)

Figure 1: Relative primal gap (left column) and relative Frank--Wolfe duality gap (right column) versus iteration for EG, GEG, and DMD. Benchmarks are matrix-free SCQP instances with spectral normalization ($\|\boldsymbol{Q}\|_2=1$) and condition number $\kappa=1\,000$. The sparsity pattern is planted with $K=0.1n$ nonzero entries and the iteration budget is $T_{\max}=200$. DMD (green) descends rapidly, reaching gaps orders of magnitude below those of EG (blue), which plateaus due to its inability to drive inactive weights to zero. Shaded regions indicate pointwise 95% confidence intervals across runs (mean $\pm 1.96\,\mathrm{SE}$).
Figure 2: Support recovery: Jaccard index (IoU) versus iteration for varying sparsity levels $K\in\{100,300,500,700\}$ with $n=1\,000$, $\kappa=1\,000$, $\mathrm{SNR}=20$ dB. DMD behaves as a step-function classifier, attaining $\mathrm{IoU}=1.0$ within 2--15 iterations. EG consistently falls short of $\mathrm{IoU}\ge 0.9$ because it assigns small nonzero probabilities to inactive elements instead of eliminating them.
Figure 3: Trajectory of support recovery (IoU) versus the relative FW duality gap (log scale). The vertical cliff for DMD and GEG indicates that these algorithms identify the correct sparsity structure ($\mathrm{IoU}=1.0$) while the duality gap is still relatively large (${\sim}\,10^{-3}$), well before reaching high numerical precision.
Figure 4: Noise robustness: final relative FW duality gap after $T_{\max}=100$ iterations versus $\mathrm{SNR}$ (dB), averaged over 50 independent noise realizations with dispersion bands. Setup: $n=2\,000$, $K=200$, $\kappa=1\,000$. DMD and GEG sustain low duality gaps for $\mathrm{SNR}>5$ dB, confirming the noise-gating property of the $q$-exponential map.
Figure 5: Robustness to ill-conditioning ($n=2\,000$, $\mathrm{SNR}=20$ dB, 100 iterations). (a) Iterations to reach the $10^{-3}$ FW duality threshold versus condition number $\kappa$ (up to $10^7$). (b) Support recovery delay versus $\kappa$. DMD exhibits remarkable insensitivity to the condition number in both metrics.

Theorems & Definitions (35)

Definition 1
Definition 2
Definition 3
Definition 4
Remark 1
Remark 2
Definition 5
Remark 3
Definition 6
Remark 4
...and 25 more

Group Entropies and Mirror Duality: A Class of Flexible Mirror Descent Updates for Machine Learning

Abstract

Group Entropies and Mirror Duality: A Class of Flexible Mirror Descent Updates for Machine Learning

Authors

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (35)