On The Statistical Representation Properties Of The Perturb-Softmax And The Perturb-Argmax Probability Distributions

Hedda Cohen Indelman; Tamir Hazan

On The Statistical Representation Properties Of The Perturb-Softmax And The Perturb-Argmax Probability Distributions

Hedda Cohen Indelman, Tamir Hazan

TL;DR

This work analyzes the statistical representation properties of Perturb-Softmax and Perturb-Argmax distributions, establishing when these perturbation-based models yield complete and minimal representations of probability distributions. By framing softmax and argmax as gradients or sub-gradients of convex functions (log-sum-exp and max), the authors leverage convex analysis and duality to derive conditions under which the parameter spaces are complete and minimal, including extensions to Gaussian perturbations (Gaussian-Softmax/Gaussian-Argmax). Theoretical results are complemented by experiments showing that Gaussian-Softmax can achieve faster convergence and superior discrete distribution approximation compared to Gumbel-Softmax, and that perturbation-type critically influences identifiability and minimality. This framework provides a rigorous foundation for selecting perturbations in discrete modeling and offers practical benefits for both generative and discriminative learning tasks.

Abstract

The Gumbel-Softmax probability distribution allows learning discrete tokens in generative learning, while the Gumbel-Argmax probability distribution is useful in learning discrete structures in discriminative learning. Despite the efforts invested in optimizing these probability models, their statistical properties are under-explored. In this work, we investigate their representation properties and determine for which families of parameters these probability distributions are complete, i.e., can represent any probability distribution, and minimal, i.e., can represent a probability distribution uniquely. We rely on convexity and differentiability to determine these statistical conditions and extend this framework to general probability models, such as Gaussian-Softmax and Gaussian-Argmax. We experimentally validate the qualities of these extensions, which enjoy a faster convergence rate. We conclude the analysis by identifying two sets of parameters that satisfy these assumptions and thus admit a complete and minimal representation. Our contribution is theoretical with supporting practical evaluation.

On The Statistical Representation Properties Of The Perturb-Softmax And The Perturb-Argmax Probability Distributions

TL;DR

Abstract

Paper Structure (29 sections, 12 theorems, 52 equations, 10 figures, 1 table)

This paper contains 29 sections, 12 theorems, 52 equations, 10 figures, 1 table.

Introduction
Related work
Background
Completeness and minimality of the Softmax operation
Gumbel-Softmax and Gumbel-Argmax probability distributions
Differentibility properties of convex functions
Perturb-Softmax probability distributions
Perturb-Argmax probability distributions
Non-minimal representation for bounded perturbations
Discrete perturbations and identifiablity
Experiments
Approximating discrete distributions
Variational inference
APPENDIX
Related work
...and 14 more sections

Key Result

Theorem 4.1

Let $\Theta \subseteq \mathbb{R}^d$ be a convex set and let $\gamma = (\gamma_1,...,\gamma_d)$ be a vector of random variables whose cumulative distribution decays to zero as $\gamma$ approaches $\pm \infty$. Let $h_i(\theta) = \theta_i - max_{j \ne i} \theta_j$ be a continuous function over $\Theta

Figures (10)

Figure 1: Illustration of the representation properties of the Perturb-Softmax and of the Perturb-Argmax.
Figure 2: An illustration of $\partial f(\theta)$ for perturbations with a smooth bounded probability density function $\gamma \sim U(-1,1)$. $\partial f(\theta)$ is a single-valued mapping between the parameters and the Perturb-Argmax probability.
Figure 3: An illustration of the sub-differential of $f(\theta)$ (Equation \ref{['discrete_max_ranges']}) w.r.t. $\theta_1$ for discrete random variables $\gamma_i \in \{1, -1\}$ that are uniformly distributed. Notably, the Perturb-Argmax probability is a multi-valued mapping in its overlapping segments, e.g., for $\theta_1= \theta_2$,
Figure 4: Gumbel-Softmax and Normal-Softmax approximation of target discrete distributions $p_0$ with finite support. The $L1$ objective over learning iterations is depicted on the right.
Figure 5: Categorical VAE with Perturb-Softmax training loss on the MNIST dataset (top row), and the Omniglot dataset (bottom row) with a $K$-dimensional categorical variable, $K\in [10,30,50]$.
...and 5 more figures

Theorems & Definitions (23)

Theorem 4.1: Completeness of Perturb-Softmax
proof
Lemma 4.2: Strict convexity
Theorem 4.3: Minimality of Perturb-Softmax
proof
Theorem 5.1: Completeness of Perturb-Argmax
proof
Lemma 5.2: Differentiability of Perturb-Max
Theorem 5.3: Minimality of Perturb-Argmax
Proposition 5.4
...and 13 more

On The Statistical Representation Properties Of The Perturb-Softmax And The Perturb-Argmax Probability Distributions

TL;DR

Abstract

On The Statistical Representation Properties Of The Perturb-Softmax And The Perturb-Argmax Probability Distributions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (23)