Fenchel-Young Estimators of Perturbed Utility Models

Xi Lin; Yafeng Yin; Tianming Liu

Fenchel-Young Estimators of Perturbed Utility Models

Xi Lin, Yafeng Yin, Tianming Liu

Abstract

The Perturbed Utility Model framework offers a powerful generalization of discrete choice analysis, unifying models like Multinomial Logit and Sparsemax through convex optimization. However, standard Maximum Likelihood Estimation (MLE) faces severe theoretical and numerical challenges when applied to this broader class, particularly regarding non-convexity and instability in sparse regimes. To resolve these issues, this paper introduces a unified estimation framework based on the Fenchel-Young loss. By leveraging the intrinsic convex conjugate structure of PUMs, we demonstrate that the Fenchel-Young estimator guarantees global convexity and bounded gradients, providing a mathematically natural alternative to MLE. Addressing the critical challenge of data scarcity, we further extend this framework via Wasserstein Distributionally Robust Optimization. We first derive an exact finite-dimensional reformulation of the infinite-dimensional primal problem, establishing its theoretical convexity. However, recognizing that the resulting worst-case constraints involve computationally intractable inner maximizations, we subsequently construct a tractable safe approximation by exploiting the global Lipschitz continuity of the Fenchel-Young loss. Through this tractable formulation, we uncover a rigorous geometric unification: two canonical regularization techniques, standard L2-regularization and the margin-enforcing Hinge loss, emerge mathematically as specific limiting cases of our distributionally robust estimator. Extensive experiments on synthetic data and the Swissmetro benchmark validate that the proposed framework significantly outperforms traditional methods, recovering stable preferences even under severe data limitations.

Fenchel-Young Estimators of Perturbed Utility Models

Abstract

Paper Structure (47 sections, 14 theorems, 140 equations, 11 figures, 3 tables)

This paper contains 47 sections, 14 theorems, 140 equations, 11 figures, 3 tables.

Introduction
Perturbed Utility Models
Basics and Examples
The Primal Formulation: Utility Maximization with Regularization
Convex Conjugacy and the Generalized Williams-Daly-Zachary Theorem
Additive Separability and the Choice Kernel
Specific Instances
Multinomial Logit (MNL):
The Sparsemax Model (Quadratic Regularization):
The Cauchy Model:
Fenchel-Young Estimation of Perturbed Utility Models
The Pathologies of Maximum Likelihood Estimation
The Zero-Probability Singularity
Non-Convexity
The Loss Function
...and 32 more sections

Key Result

Proposition 1

Under assumptions (A1)-(A2), the surplus function $\Omega(\mathbf{V}_n)$ is convex and differentiable everywhere on $\mathbb{R}^{|\mathcal{C}|}$. Furthermore, the optimal choice probability vector satisfying Eq. eq:pum_primal is given by:

Figures (11)

Figure 1: Geometric Interpretation of Fenchel-Young Estimation: Duality between Utility and Probability Spaces. The Left Panel displays the surplus function $\Omega(\mathbf{V})$ in the utility space. The model prediction $\mathbf{p}$ corresponds to the gradient (slope) of the tangent at the current utility $\mathbf{V}_{\text{curr}}$. The estimation goal is to align this slope with the target slope defined by the observed label $\mathbf{y}$. The Right Panel displays the conjugate regularization function $\Lambda(\mathbf{p})$ in the probability space. Here, the Fenchel-Young loss is visualized as the Bregman divergence $D_\Lambda(\mathbf{y} \| \mathbf{p})$, i.e., the vertical gap between the function value $\Lambda(\mathbf{y})$ and the tangent hyperplane constructed at the prediction $\mathbf{p}$. Crucially, the two views are mathematically equivalent via the Legendre-Fenchel transform: the slope in the left panel ($\mathbf{p}$) becomes the coordinate in the right panel, and the slope in the right panel ($\mathbf{V}$) corresponds to the coordinate in the left.
Figure 2: Convergence performance of the Projected Extragradient algorithm for the standard Fenchel-Young estimator under a non-separable quadratic perturbation. Panel (a) illustrates the rapid stabilization of the parameter error in Euclidean distance. Panel (b) demonstrates the monotonic decrease of the KKT residual on a logarithmic scale. The algorithm achieves linear convergence in this scenario.
Figure 3: Convergence performance of the solution algorithm for the Wasserstein DRO estimator. Panel (a) illustrates the smooth stabilization of the parameter error. Panel (b) demonstrates the oscillated decrease of the KKT residual on a logarithmic scale over 5,000 iterations.
Figure 4: Validation of the regularization scaling law. The scatter plot compares the optimal regularization parameter $\lambda$ obtained via exhaustive line search against the predicted value from the empirical formula. The strong alignment along the $y=x$ line demonstrates that the formula $\lambda_{\text{reg}} = 0.13 \frac{\sqrt{d/N}}{\|\boldsymbol{\beta}^*\|_{2}}$ yields results that are highly consistent with the optimal parameters.
Figure 5: Comparison of MSE between the estimators. Notably, when the sample size is small ($N \le 200$), the $\ell_{2}$ regularized Fenchel-Young estimator achieves a significantly lower average MSE compared to the original FY estimator, demonstrating better robustness in data-scarce regimes.
...and 6 more figures

Theorems & Definitions (27)

Proposition 1: PUM Duality
Definition 1: Additive Separable PUM
Definition 2: Choice Kernel
Example 1: Non-convexity of MLE for Cauchy PUMs
Proposition 2: Proper Scoring
proof
Proposition 3: FY vs. MLE for Multinomial Logit
Proposition 4: Global Convexity of PUM Estimation
proof
Lemma 1: Continuity and domination of the FY loss
...and 17 more

Fenchel-Young Estimators of Perturbed Utility Models

Abstract

Fenchel-Young Estimators of Perturbed Utility Models

Authors

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (27)