On The Statistical Representation Properties Of The Perturb-Softmax And The Perturb-Argmax Probability Distributions
Hedda Cohen Indelman, Tamir Hazan
TL;DR
This work analyzes the statistical representation properties of Perturb-Softmax and Perturb-Argmax distributions, establishing when these perturbation-based models yield complete and minimal representations of probability distributions. By framing softmax and argmax as gradients or sub-gradients of convex functions (log-sum-exp and max), the authors leverage convex analysis and duality to derive conditions under which the parameter spaces are complete and minimal, including extensions to Gaussian perturbations (Gaussian-Softmax/Gaussian-Argmax). Theoretical results are complemented by experiments showing that Gaussian-Softmax can achieve faster convergence and superior discrete distribution approximation compared to Gumbel-Softmax, and that perturbation-type critically influences identifiability and minimality. This framework provides a rigorous foundation for selecting perturbations in discrete modeling and offers practical benefits for both generative and discriminative learning tasks.
Abstract
The Gumbel-Softmax probability distribution allows learning discrete tokens in generative learning, while the Gumbel-Argmax probability distribution is useful in learning discrete structures in discriminative learning. Despite the efforts invested in optimizing these probability models, their statistical properties are under-explored. In this work, we investigate their representation properties and determine for which families of parameters these probability distributions are complete, i.e., can represent any probability distribution, and minimal, i.e., can represent a probability distribution uniquely. We rely on convexity and differentiability to determine these statistical conditions and extend this framework to general probability models, such as Gaussian-Softmax and Gaussian-Argmax. We experimentally validate the qualities of these extensions, which enjoy a faster convergence rate. We conclude the analysis by identifying two sets of parameters that satisfy these assumptions and thus admit a complete and minimal representation. Our contribution is theoretical with supporting practical evaluation.
