Convex optimization over a probability simplex

James Chok; Geoffrey M. Vasil

Convex optimization over a probability simplex

James Chok, Geoffrey M. Vasil

TL;DR

This work introduces the Cauchy-Simplex, a gradient-flow–based method for convex optimization over the probability simplex that preserves both positivity and the unit-sum constraint. By lifting the problem to a latent space and deriving a continuous-time gradient flow, the authors obtain a discretization that maintains simplex constraints without costly projections, with a proven $O(1/T)$ convergence in continuous time and sublinear rates for discrete schemes. The method unifies ideas from projected and exponentiated gradient methods and connects to information-theoretic quantities like relative entropy in its convergence analysis. Extensions to orthogonal matrix constraints via Cayley transforms are provided, along with applications to convex hull projection, approximate question weighting, and online learning (prediction with expert advice and universal portfolios), supported by empirical results. The CS framework offers a simple, scalable, and theoretically grounded alternative for high-dimensional simplex-constrained optimization with practical benefits in several domains.

Abstract

We propose a new iteration scheme, the Cauchy-Simplex, to optimize convex problems over the probability simplex $\{w\in\mathbb{R}^n\ |\ \sum_i w_i=1\ \textrm{and}\ w_i\geq0\}$. Specifically, we map the simplex to the positive quadrant of a unit sphere, envisage gradient descent in latent variables, and map the result back in a way that only depends on the simplex variable. Moreover, proving rigorous convergence results in this formulation leads inherently to tools from information theory (e.g., cross-entropy and KL divergence). Each iteration of the Cauchy-Simplex consists of simple operations, making it well-suited for high-dimensional problems. In continuous time, we prove that $f(x_T)-f(x^*) = {O}(1/T)$ for differentiable real-valued convex functions, where $T$ is the number of time steps and $w^*$ is the optimal solution. Numerical experiments of projection onto convex hulls show faster convergence than similar algorithms. Finally, we apply our algorithm to online learning problems and prove the convergence of the average regret for (1) Prediction with expert advice and (2) Universal Portfolios.

Convex optimization over a probability simplex

TL;DR

convergence in continuous time and sublinear rates for discrete schemes. The method unifies ideas from projected and exponentiated gradient methods and connects to information-theoretic quantities like relative entropy in its convergence analysis. Extensions to orthogonal matrix constraints via Cayley transforms are provided, along with applications to convex hull projection, approximate question weighting, and online learning (prediction with expert advice and universal portfolios), supported by empirical results. The CS framework offers a simple, scalable, and theoretically grounded alternative for high-dimensional simplex-constrained optimization with practical benefits in several domains.

Abstract

We propose a new iteration scheme, the Cauchy-Simplex, to optimize convex problems over the probability simplex

. Specifically, we map the simplex to the positive quadrant of a unit sphere, envisage gradient descent in latent variables, and map the result back in a way that only depends on the simplex variable. Moreover, proving rigorous convergence results in this formulation leads inherently to tools from information theory (e.g., cross-entropy and KL divergence). Each iteration of the Cauchy-Simplex consists of simple operations, making it well-suited for high-dimensional problems. In continuous time, we prove that

for differentiable real-valued convex functions, where

is the number of time steps and

is the optimal solution. Numerical experiments of projection onto convex hulls show faster convergence than similar algorithms. Finally, we apply our algorithm to online learning problems and prove the convergence of the average regret for (1) Prediction with expert advice and (2) Universal Portfolios.

Paper Structure (24 sections, 9 theorems, 121 equations, 2 figures, 3 tables, 1 algorithm)

This paper contains 24 sections, 9 theorems, 121 equations, 2 figures, 3 tables, 1 algorithm.

Introduction
Previous Works
The Main Algorithm
Motivating Derivation
On the Learning Rate
Connections to Previous Methods
The Algorithm
Convergence Proof
Extension: Optimization over Orthogonal Matrices
Applications
Projection onto the Convex Hull
Optimal Question Weighting
Prediction from Expert Advice
Universal Portfolio
Conclusion
...and 9 more sections

Key Result

Theorem 6

Let $f$ be convex with Lipschitz continuous gradient, real-valued and continuously differentiable w.r.t. $w^t$ and $w^t$ continuously differentiable w.r.t. $t$. For the Cauchy-Simplex gradient flow (eq:cauchy_simplex_dw_dt) with initial condition $w^0\in\textrm{relint}(\Delta^n)$, $f(w^t)$ is a stri

Figures (2)

Figure 1: Number of steps and time required for PFW, EGD, and CS to project 50 randomly sampled points onto the $d$-hypercube. The bars indicate the minimum and maximum values.
Figure 2: Optimal question weighting for (randomly generated) exam scores with 200 students and 75 questions. The setup follows the experimental details in Section \ref{['section:optimal_question_weighting']}. The kernel density estimate uses a truncated unit normal distribution with $\varepsilon=0.05$, and the target distribution is a truncated normal distribution with a mean of 0.5 and a standard deviation of 0.1. We take $w^0=(0.01,\ldots, 0.01)$, and the Cauchy-Simplex is applied, with each step using a backtracking line search. The resulting weighted histogram and kernel density estimate is shown. The distribution of the weighted marks is shown on the top row, and its QQ plots against a normal distribution of mean 0.5 and a standard deviation of 0.1 are shown on the bottom row. At iterations 0, 5, and 20, the weighted scores have a mean of 0.499, 0.514, and 0.501, with standard deviations of 0.128, 0.124, and 0.109, respectively.

Theorems & Definitions (15)

Remark 1
Claim 2
Remark 3
Claim 4
Remark 5
Theorem 6
Theorem 7
Theorem 8
Theorem 9: Convergence of Linear Scheme
Lemma 10: Asymptotic Convergence of Linear Scheme
...and 5 more

Convex optimization over a probability simplex

TL;DR

Abstract

Convex optimization over a probability simplex

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (15)