On Parameter Estimation in Deviated Gaussian Mixture of Experts
Huy Nguyen, Khai Nguyen, Nhat Ho
TL;DR
This work analyzes maximum likelihood estimation in a deviated Gaussian mixture of experts with unknown mixing proportion $\lambda^*$. By introducing Voronoi-based loss functions, the authors derive sharp convergence rates for density and parameter estimation under both distinguishable and non-distinguishable interactions between the known component $g_0$ and the mixture, revealing rates that depend on how many fitted components capture true atoms. They show the proposed losses yield parametric density convergence at $\widetilde{\mathcal{O}}(n^{-1/2})$ and provide component-wise rates influenced by polynomial-equation solvability, improving over generalized Wasserstein bounds. Simulations corroborate the theory, illustrating faster parameter recovery for atoms fitted by single components and slower rates when multiple components cover a single atom, with distinct behavior for $a$, $b$, and $\sigma$. These results advance understanding of parameter estimation in MoEs under model deviations and interactions, with practical implications for goodness-of-fit testing and mixture-model inference.
Abstract
We consider the parameter estimation problem in the deviated Gaussian mixture of experts in which the data are generated from $(1 - λ^{\ast}) g_0(Y| X)+ λ^{\ast} \sum_{i = 1}^{k_{\ast}} p_{i}^{\ast} f(Y|(a_{i}^{\ast})^{\top}X+b_i^{\ast},σ_{i}^{\ast})$, where $X, Y$ are respectively a covariate vector and a response variable, $g_{0}(Y|X)$ is a known function, $λ^{\ast} \in [0, 1]$ is true but unknown mixing proportion, and $(p_{i}^{\ast}, a_{i}^{\ast}, b_{i}^{\ast}, σ_{i}^{\ast})$ for $1 \leq i \leq k^{\ast}$ are unknown parameters of the Gaussian mixture of experts. This problem arises from the goodness-of-fit test when we would like to test whether the data are generated from $g_{0}(Y|X)$ (null hypothesis) or they are generated from the whole mixture (alternative hypothesis). Based on the algebraic structure of the expert functions and the distinguishability between $g_0$ and the mixture part, we construct novel Voronoi-based loss functions to capture the convergence rates of maximum likelihood estimation (MLE) for our models. We further demonstrate that our proposed loss functions characterize the local convergence rates of parameter estimation more accurately than the generalized Wasserstein, a loss function being commonly used for estimating parameters in the Gaussian mixture of experts.
