On Parameter Estimation in Deviated Gaussian Mixture of Experts

Huy Nguyen; Khai Nguyen; Nhat Ho

On Parameter Estimation in Deviated Gaussian Mixture of Experts

Huy Nguyen, Khai Nguyen, Nhat Ho

TL;DR

This work analyzes maximum likelihood estimation in a deviated Gaussian mixture of experts with unknown mixing proportion $\lambda^*$. By introducing Voronoi-based loss functions, the authors derive sharp convergence rates for density and parameter estimation under both distinguishable and non-distinguishable interactions between the known component $g_0$ and the mixture, revealing rates that depend on how many fitted components capture true atoms. They show the proposed losses yield parametric density convergence at $\widetilde{\mathcal{O}}(n^{-1/2})$ and provide component-wise rates influenced by polynomial-equation solvability, improving over generalized Wasserstein bounds. Simulations corroborate the theory, illustrating faster parameter recovery for atoms fitted by single components and slower rates when multiple components cover a single atom, with distinct behavior for $a$, $b$, and $\sigma$. These results advance understanding of parameter estimation in MoEs under model deviations and interactions, with practical implications for goodness-of-fit testing and mixture-model inference.

Abstract

We consider the parameter estimation problem in the deviated Gaussian mixture of experts in which the data are generated from $(1 - λ^{\ast}) g_0(Y| X)+ λ^{\ast} \sum_{i = 1}^{k_{\ast}} p_{i}^{\ast} f(Y|(a_{i}^{\ast})^{\top}X+b_i^{\ast},σ_{i}^{\ast})$, where $X, Y$ are respectively a covariate vector and a response variable, $g_{0}(Y|X)$ is a known function, $λ^{\ast} \in [0, 1]$ is true but unknown mixing proportion, and $(p_{i}^{\ast}, a_{i}^{\ast}, b_{i}^{\ast}, σ_{i}^{\ast})$ for $1 \leq i \leq k^{\ast}$ are unknown parameters of the Gaussian mixture of experts. This problem arises from the goodness-of-fit test when we would like to test whether the data are generated from $g_{0}(Y|X)$ (null hypothesis) or they are generated from the whole mixture (alternative hypothesis). Based on the algebraic structure of the expert functions and the distinguishability between $g_0$ and the mixture part, we construct novel Voronoi-based loss functions to capture the convergence rates of maximum likelihood estimation (MLE) for our models. We further demonstrate that our proposed loss functions characterize the local convergence rates of parameter estimation more accurately than the generalized Wasserstein, a loss function being commonly used for estimating parameters in the Gaussian mixture of experts.

On Parameter Estimation in Deviated Gaussian Mixture of Experts

TL;DR

This work analyzes maximum likelihood estimation in a deviated Gaussian mixture of experts with unknown mixing proportion

. By introducing Voronoi-based loss functions, the authors derive sharp convergence rates for density and parameter estimation under both distinguishable and non-distinguishable interactions between the known component

and the mixture, revealing rates that depend on how many fitted components capture true atoms. They show the proposed losses yield parametric density convergence at

and provide component-wise rates influenced by polynomial-equation solvability, improving over generalized Wasserstein bounds. Simulations corroborate the theory, illustrating faster parameter recovery for atoms fitted by single components and slower rates when multiple components cover a single atom, with distinct behavior for

, and

. These results advance understanding of parameter estimation in MoEs under model deviations and interactions, with practical implications for goodness-of-fit testing and mixture-model inference.

Abstract

We consider the parameter estimation problem in the deviated Gaussian mixture of experts in which the data are generated from

, where

are respectively a covariate vector and a response variable,

is a known function,

is true but unknown mixing proportion, and

for

are unknown parameters of the Gaussian mixture of experts. This problem arises from the goodness-of-fit test when we would like to test whether the data are generated from

(null hypothesis) or they are generated from the whole mixture (alternative hypothesis). Based on the algebraic structure of the expert functions and the distinguishability between

and the mixture part, we construct novel Voronoi-based loss functions to capture the convergence rates of maximum likelihood estimation (MLE) for our models. We further demonstrate that our proposed loss functions characterize the local convergence rates of parameter estimation more accurately than the generalized Wasserstein, a loss function being commonly used for estimating parameters in the Gaussian mixture of experts.

Paper Structure (22 sections, 10 theorems, 163 equations, 2 figures, 1 table)

This paper contains 22 sections, 10 theorems, 163 equations, 2 figures, 1 table.

INTRODUCTION
BACKGROUND
CONVERGENCE RATES OF PARAMETER ESTIMATION
Distinguishable Settings
Non-distinguishable Settings
Proof Sketch
SIMULATION STUDY
CONCLUSION
PROOF OF MAIN RESULTS
Proof of Theorem \ref{['theorem:distinguishable_dependent']}
Proof of Theorem \ref{['theorem:partial_dependent']}
PROOF OF AUXILIARY RESULTS
Proof of Proposition \ref{['prop:identifiable']}
Proof of Proposition \ref{['prop:mle_estimation']}
Main Proof
...and 7 more sections

Key Result

Proposition 1

Let $G,G'$ be two mixing measures in $\mathcal{O}_{k}(\Theta)$ and $\lambda,\lambda'$ be two mixing proportions in $[0,1]$. Assume that $p_{G_*}$ is distinguishable from $g_0$, then if the identifiability equation $p_{\lambda, G}(X,Y)=p_{\lambda', G'}(X,Y)$ holds for almost surely $(X,Y)\in\mathcal{

Figures (2)

Figure 1: Illustration of the Voronoi cells generated by the components of $G_{\ast}$ (red crosses) and the fitted components of the MLE $\widehat{G}_n$ (blue points). Under the distinguishable settings, Theorem \ref{['theorem:distinguishable_dependent']} indicates that the rates for estimating true components $(a^*_j,b^*_j,\sigma^*_j)$ in cells $1,2,4,6$, which are fitted by one component, are of order $\widetilde{\mathcal{O}}(n^{-1/2})$. Meanwhile, those for true components $(a^*_3,b^*_3,\sigma^*_3)$ in cell 3, which are fitted by three components, are drastically slow at $\widetilde{\mathcal{O}}(n^{-1/4})$, $\widetilde{\mathcal{O}}(n^{-1/12})$ and $\widetilde{\mathcal{O}}(n^{-1/6})$, respectively.
Figure 2: Convergence rate of the maximum likelihood estimation $(\hat{\lambda}_n,\widehat{G}_n)$ under the Voronoi loss functions.

Theorems & Definitions (14)

Definition 1: Distinguishability Condition
Example 1
Proposition 1: Identifiability
Proposition 2: Density estimation rate
Theorem 1
Theorem 2
Lemma 1
Lemma 2
Lemma 3: Theorem 5.11 in Vandegeer-2000
Theorem 3
...and 4 more

On Parameter Estimation in Deviated Gaussian Mixture of Experts

TL;DR

Abstract

On Parameter Estimation in Deviated Gaussian Mixture of Experts

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (14)