Table of Contents
Fetching ...

On the existence of the maximum likelihood estimate and convergence rate under gradient descent for multi-class logistic regression

Dwight Nwaigwe, Marek Rychlik

TL;DR

This work proves the existence of the maximum likelihood estimate for multiclass logistic regression under label smoothing without requiring data separability, by leveraging a shift-invariant, locally strongly convex loss on a subspace. It then derives a constructive convergence-rate bound for gradient descent by analyzing the Hessian’s spectrum, expressing the Hessian in a Kronecker-product form and bounding its eigenvalues. The results provide explicit, data-dependent bounds on the Hessian and condition number, both for the full-rank case $N=D$ and the overdetermined case $N>D$, enabling practical contraction-rate guarantees. The analysis hinges on an operator-theoretic framework, translating neural-network-like optimization questions into linear-algebraic bounds tied to data geometry and class-probability structure, with clear implications for convergence behavior in large-scale multiclass settings.

Abstract

We revisit the problem of the existence of the maximum likelihood estimate for multi-class logistic regression. We show that one method of ensuring its existence is by assigning positive probability to every class in the sample dataset. The notion of data separability is not needed, which is in contrast to the classical set up of multi-class logistic regression in which each data sample belongs to one class. We also provide a general and constructive estimate of the convergence rate to the maximum likelihood estimate when gradient descent is used as the optimizer. Our estimate involves bounding the condition number of the Hessian of the maximum likelihood function. The approaches used in this article rely on a simple operator-theoretic framework.

On the existence of the maximum likelihood estimate and convergence rate under gradient descent for multi-class logistic regression

TL;DR

This work proves the existence of the maximum likelihood estimate for multiclass logistic regression under label smoothing without requiring data separability, by leveraging a shift-invariant, locally strongly convex loss on a subspace. It then derives a constructive convergence-rate bound for gradient descent by analyzing the Hessian’s spectrum, expressing the Hessian in a Kronecker-product form and bounding its eigenvalues. The results provide explicit, data-dependent bounds on the Hessian and condition number, both for the full-rank case and the overdetermined case , enabling practical contraction-rate guarantees. The analysis hinges on an operator-theoretic framework, translating neural-network-like optimization questions into linear-algebraic bounds tied to data geometry and class-probability structure, with clear implications for convergence behavior in large-scale multiclass settings.

Abstract

We revisit the problem of the existence of the maximum likelihood estimate for multi-class logistic regression. We show that one method of ensuring its existence is by assigning positive probability to every class in the sample dataset. The notion of data separability is not needed, which is in contrast to the classical set up of multi-class logistic regression in which each data sample belongs to one class. We also provide a general and constructive estimate of the convergence rate to the maximum likelihood estimate when gradient descent is used as the optimizer. Our estimate involves bounding the condition number of the Hessian of the maximum likelihood function. The approaches used in this article rely on a simple operator-theoretic framework.

Paper Structure

This paper contains 8 sections, 17 theorems, 86 equations, 1 figure.

Key Result

Lemma 1

Assume $\mathbf{T}>0$, $N=D$, and that $\mathbf{X}$ is invertible. Then a minimum of $L$ exists and every minimum $\widetilde{\mathbf{W}}$ is given by where $\mathbf{R} = \ln(\mathbf{T})$ (elementwise logarithm) and $\mathbf{c}\in\mathbb{R}^D$ is arbitrary. Exactly one of the minima belongs to $Z$ and is and is also the matrix obtained from $\mathbf{R}\mathbf{X}^{-1}$ obtained by subtracting fro

Figures (1)

  • Figure 1: Each figure corresponds to 2000 realizations, with the matrix $\mathbf{K}$ used to compute $\mathbf{A}$ given by \ref{['nonisometric-K']}. The eigenvalues of $\mathbf{A}$ are computed for $C=3,6,9,12,15,18$. On left: plots of the frequencies of the ratio $\frac{\lambda_{C-1}\left(A\right)}{y_{min}}$. On right: plots of the frequencies of the ratio $\frac{\lambda_{C-1}\left(A\right)}{y_{min}}$.

Theorems & Definitions (35)

  • Definition 1: Positivity of a Matrix
  • Definition 2: Nullspace and Range of a Linear Operator
  • Lemma 1
  • proof
  • Corollary 1
  • proof
  • Theorem 3.1: Existence of minimum, $\mathop{\mathrm{\mathrm{rank}}}\nolimits\left(\mathbf{X}\right)=D$
  • proof
  • Theorem 3.2: Existence of minimum, $\mathop{\mathrm{\mathrm{rank}}}\nolimits\left(\mathbf{X}\right)$ arbitrary
  • proof
  • ...and 25 more