Extremal graphical modeling with latent variables via convex optimization
Sebastian Engelke, Armeen Taeb
TL;DR
This work addresses learning extremal graphical models in the presence of latent variables by introducing eglatent, a convex program that decomposes the marginal HR precision \tilde{\Theta} into a sparse observed-graph part and a low-rank latent-effect part. It establishes a Schur-complement-like relationship \tilde{\Theta} = \Theta_O - \Theta_{OH}\Theta_H^{-1}\Theta_{HO} and promotes a sparse-plus-low-rank structure to identify both the conditional graph among observed variables and the latent factors. The authors provide identifiability conditions and finite-sample consistency guarantees, derive an empirical variogram estimator, and demonstrate improved structure recovery and latent-variable counting on synthetic data and a real flight-delay dataset. The method yields more interpretable extremal models with better tail-dependency representations, enabling robust risk assessment for extreme events in high dimensions. The work also offers practical guidance and open-source tooling for practitioners analyzing extremes with latent confounders.
Abstract
Extremal graphical models encode the conditional independence structure of multivariate extremes and provide a powerful tool for quantifying the risk of rare events. Prior work on learning these graphs from data has focused on the setting where all relevant variables are observed. For the popular class of Hüsler-Reiss models, we propose the \texttt{eglatent} method, a tractable convex program for learning extremal graphical models in the presence of latent variables. Our approach decomposes the Hüsler-Reiss precision matrix into a sparse component encoding the graphical structure among the observed variables after conditioning on the latent variables, and a low-rank component encoding the effect of a few latent variables on the observed variables. We provide finite-sample guarantees of \texttt{eglatent} and show that it consistently recovers the conditional graph as well as the number of latent variables. We highlight the improved performances of our approach on synthetic and real data.
