Statistical methodology and applications in science
Sparse identification of nonlinear dynamics (SINDy) has been widely used to discover the governing equations of a dynamical system from data. It uses sparse regression techniques to identify parsimonious models of unknown systems from a library of candidate functions. Therefore, it relies on the assumption that the dynamics are sparsely represented in the coordinate system used. To address this limitation, one seeks a coordinate transformation that provides reduced coordinates capable of reconstructing the original system. Recently, SINDy autoencoders have extended this idea by combining sparse model discovery with autoencoder architectures to learn simplified latent coordinates together with parsimonious governing equations. A central challenge in this framework is robustness to measurement error. Inspired by noise-separating neural network structures, we incorporate a noise-separation module into the SINDy autoencoder architecture, thereby improving robustness and enabling more reliable identification of noisy dynamical systems. Numerical experiments on the Lorenz system show that the proposed method recovers interpretable latent dynamics and accurately estimates the measurement noise from noisy observations.
K-means clustering, a classic and widely-used clustering technique, is known to exhibit suboptimal performance when applied to non-linearly separable data. Numerous adjustments and modifications have been proposed to address this issue, including methods that merge K-means results from a relatively large K to obtain a final cluster assignment. However, existing methods of this nature often encounter computational inefficiencies and suffer from hyperparameter tuning. Here we present \emph{CavMerge}, a novel K-means merging algorithm that is intuitive, free of parameter tuning, and computationally efficient. Operating under minimal local distributional assumptions, our algorithm demonstrates strong consistency and rapid convergence guarantees. Empirical studies on various simulated and real datasets demonstrate that our method yields more reliable clusters in comparison to current state-of-the-art algorithms.
Sequential estimators are proposed for the relative risk, odds ratio, log relative risk or log odds ratio of a dichotomous attribute in two populations. The estimators take the same number of observations from each population, and guarantee that the relative mean-square error for the relative risk or odds ratio, or the mean-square error for their logarithmic versions, is less than a given target. The efficiency of the estimators, defined in terms of the Cramér-Rao bound, is high when the considered attribute is rare or moderately rare.
Small area estimation (SAE) produces estimates of population parameters for geographic and demographic subgroups with limited sample sizes. Such estimates are critical for informing policy decisions, ranging from poverty mapping to social program funding. Despite its widespread use, principled validation of SAE models remains challenging and general guidelines are far from well-established. Unlike conventional predictive modeling settings, validation data are rarely available in the SAE context. External validation surveys or censuses often do not exist, and access to individual-level microdata is often restricted, making standard cross-validation infeasible. In this paper, we propose a novel model validation scheme using only area-level direct survey estimates under the widely used Fay--Herriot model. Our approach is based on data thinning, which splits area-level observations into independent training and test components to enable out-of-sample validation. Our theoretical analysis reveals a fundamental tension inherent in thinning-based validating: performance metrics measured on the thinned training component targets a different quantity than that based on the full data, with the gap varying by model complexity. Increasing the information allocated for training reduces this gap but inflates the variance of the estimator. We formally characterize this bias-variance tradeoff and provide practical recommendations for the thinning parameters that balance these competing considerations for model comparison. We show that data thinning with these settings provides consistent and stable performance across heterogeneous sampling designs in design-based simulations using American Community Survey microdata.
In many modern applications, a carefully designed primary study provides individual-level data for interpretable modeling, while summary-level external information is available through black-box, efficient, and nonparametric machine-learning predictions. Although summary-level external information has been studied in the data integration literature, there is limited methodology for leveraging external nonparametric machine-learning predictions to improve statistical inference in the primary study. We propose a general empirical-likelihood framework that incorporates external predictions through moment constraints. An advantage of nonparametric machine-learning prediction is that it induces a rich class of valid moment restrictions that remain robust to covariate shift under a mild overlap condition without requiring explicit density-ratio modeling. We focus on multinomial logistic regression as the primary model and address common data-quality issues in external sources, including coarsened outcomes, partially observed covariates, covariate shift, and heterogeneity in generating mechanisms known as concept shift. We establish large-sample properties of the resulting fused estimator, including consistency and asymptotic normality under regularity conditions. Moreover, we provide mild sufficient conditions under which incorporating external predictions delivers a strict efficiency gain relative to the primary-only estimator. Simulation studies and an application to the National Health and Nutrition Examination Survey on multiclass blood-pressure classification.
This work reconciles two perspectives on the Elo ranking that coexist in the literature: the practitioner's view as a heuristic feedback rule, and the statistician's view as online maximum likelihood estimation via stochastic gradient ascent. Both perspectives coincide exactly in the binary case (iff the expected score is the logistic function). However, estimation noise forces a principled decoupling between the model used for ranking and the model used for prediction: the effective scale and home-field advantage parameter must be adjusted to account for the noise. We provide both closed-form corrections and a data-driven identification procedure. For multilevel outcomes, an exact relationship exists when outcome scores are uniformly spaced, but approximations are preferred in general: they account for estimation noise and better fit the data. The decoupled approach substantially outperforms the conventional one that reuses the ranking model for prediction, and serves as a diagnostic of convergence status. Applied to six years of FIFA men's ranking, we find that the ranking had not converged for the vast majority of national teams. The paper is written in a semi-tutorial style accessible to practitioners, with all key results accompanied by closed-form expressions and numerical examples.
This paper introduces a novel measure to quantify the directional dependence of extreme events between two variables. The proposed approach is designed to capture asymmetric tail dependence by studying conditional tail expectations of rank-transformed variables, thereby quantifying the behavior of one variable when the other takes extreme values. We investigate the theoretical asymptotic behavior of the associated estimator. The effectiveness of the approach is demonstrated through an extensive simulation study. In addition, we discuss the use of the proposed coefficient for the detection of causal effects in extreme events. Finally, we apply the method to an oceanographic dataset, where the results highlight the strong asymmetric nature of extreme events and identify the dominant directions of extremal influence among key oceanographic variables. As a directional measure of tail dependence, our approach provides a natural tool for exploring causal-effect relationships in extreme-value settings.
This paper investigates the impact of carbon pricing under the EU Emissions Trading System (EU ETS) on the Italian electricity market, focusing on the carbon cost pass-through rate (CPTR) across market zones during Phases 3 and 4 (2016-2024). Using daily data, the study applies an econometric framework based on a linear regression model with autoregressive dynamics to estimate the extent to which carbon costs are reflected in wholesale electricity prices. It further incorporates robustness checks and quantile regression to assess how the CPTR varies across different fuel spread levels. The results show that carbon costs are positively and significantly transmitted to electricity prices, confirming the relevance of carbon pricing as a key market driver. However, pass-through is incomplete, with CPTR values consistently below 100%. At the national level, the CPTR remains relatively stable at around 30% across the two phases. Substantial heterogeneity emerges across market zones: pass-through increases in the North, Centre-North, and Sardinia during Phase 4, while it declines in the Centre-South and Sicily, reflecting differences in generation mix, carbon intensity, and market conditions. Overall, the findings highlight the importance of market zones factors in shaping the effectiveness of carbon pricing in electricity markets.
In this paper, we consider the academic department ranking system of Italy, which is based on a performance index named Indice Standardizzato di Performance Dipartimentale (ISPD). While critiques to the ISPD have been moved for its marked tendency to polarization, we here formalize a yet unexplored determinant of this phenomenon, that is, the presence of within-department homogeneity among the standardized scores used to build the index. We account for this intra-departmental correlation by modeling it as a function of departments' size. The proposed model, estimated via Maximum Likelihood, allows to build a fairer ranking procedure via the definition of a properly adjusted version of the ISPD. The estimation framework is also adapted to fit publicly available data, which are coarsened by rounding and/or left-truncated. To this end, a novel probability distribution termed Betoidal is introduced. Empirical evidence in favor of the proposed model is found in the 2017 and 2022 data. Moreover, a simulation study shows that the adjusted index significantly overcomes not only the original ISPD, but also other more data-demanding competing proposals.
We study high-dimensional mediation analysis in which exposures, mediators, and outcomes are all multivariate, and both exposures and mediators may be high-dimensional. We formalize this as a many (exposures)-to-many (mediators)-to-many (outcomes) (MMM) mediation analysis problem. Methodologically, MMM mediation analysis simultaneously performs variable selection for high-dimensional exposures and mediators, estimates the indirect effect matrix (i.e., the coefficient matrices linking exposure-to-mediator and mediator-to-outcome pathways), and enables prediction of multivariate outcomes. Theoretically, we show that the estimated indirect effect matrices are consistent and element-wise asymptotically normal, and we derive error bounds for the estimators. To evaluate the efficacy of the MMM mediation framework, we first investigate its finite-sample performance, including convergence properties, the behavior of the asymptotic approximations, and robustness to noise, via simulation studies. We then apply MMM mediation analysis to data from the Alzheimer's Disease Neuroimaging Initiative to study how cortical thickness of 202 brain regions may mediate the effects of 688 genome-wide significant single nucleotide polymorphisms (SNPs) (selected from approximately 1.5 million SNPs) on eleven cognitive-behavioral and diagnostic outcomes. The MMM mediation framework identifies biologically interpretable, many-to-many-to-many genetic-neural-cognitive pathways and improves downstream out-of-sample classification and prediction performance. Taken together, our results demonstrate the potential of MMM mediation analysis and highlight the value of statistical methodology for investigating complex, high-dimensional multi-layer pathways in science. The MMM package is available at https://github.com/THELabTop/MMM-Mediation.
2604.02802We introduce a family of scale-invariant entropy statistics derived from logarithmically aggregated distance distributions of point processes, with prime numbers serving as a motivating example. The construction associates to each finite configuration a scalar quantity encoding structural features of relative spacing while remaining insensitive to absolute scale. This work is intended as a methodological contribution rather than a source of new raw results.
Latent space models are widely used in statistical network analysis and are often fit by Markov chain Monte Carlo. However, posterior summaries of latent coordinates are not canonical because the likelihood depends only on pairwise distances and is invariant under rigid motions of the latent space. Standard post hoc alignment can aid visualization, but the resulting summaries depend on an arbitrary reference configuration. We propose a quotient-based posterior analysis for Euclidean latent space models using the centered Gram map, which represents identifiable latent structure while removing nonidentifiability. This yields intrinsic posterior summaries of mean structure and uncertainty that can be computed directly from posterior samples, together with basic theoretical guarantees including canonicality, existence, and stability. Through simulations and analyses of the Florentine marriage network and a statisticians' coauthorship network, the proposed framework clarifies when alignment-based summaries are stable, when they become reference-sensitive, and which nodes or relationships are weakly identified. These results show how coherent posterior analysis can reveal latent relational structure beyond a single embedding.
We develop Wasserstein-based hypothesis tests for empirical-measure convergence in stationary dependent sequences. For a known candidate invariant measure $μ$, we study the statistic $T_n=\sqrt{n}\,W_1(\hatμ_n,μ)$ and establish asymptotic level-$α$ validity under the null, together with consistency under fixed alternatives. When the invariant measure is unknown, we derive the asymptotic law of the pairwise statistic $\sqrt{n}\,W_1(\hatμ_n^{(i)},\hatμ_n^{(j)})$ for independent trajectories and obtain a corresponding pairwise test, including Bonferroni control for multiple comparisons. Simulation experiments involving both linear and nonlinear dynamical settings illustrate both the coverage probability and the power of the tests.
Clinical evidence synthesis requires identifying relevant trials from large registries and aggregating results that account for population differences. While recent LLM-based approaches have automated components of systematic review, they do not support end-to-end evidence synthesis. Moreover, conventional meta-analysis weights studies by statistical precision without considering clinical compatibility reflected in eligibility criteria. We propose EligMeta, an agentic framework that integrates automated trial discovery with eligibility-aware meta-analysis, translating natural-language queries into reproducible trial selection and incorporating eligibility alignment into study weighting to produce cohort-specific pooled estimates. EligMeta employs a hybrid architecture separating LLM-based reasoning from deterministic execution: LLMs generate interpretable rules from natural-language queries and perform schema-constrained parsing of trial metadata, while all logical operations, weight computations, and statistical pooling are executed deterministically to ensure reproducibility. The framework structures eligibility criteria and computes similarity-based study weights reflecting population alignment between target and comparator trials. In a gastric cancer landscape analysis, EligMeta reduced 4,044 candidate trials to 39 clinically relevant studies through rule-based filtering, recovering all 13 guideline-cited trials. In an olaparib adverse events meta-analysis across four trials, eligibility-aware weighting shifted the pooled risk ratio from 2.18 (95% CI: 1.71-2.79) under conventional Mantel-Haenszel estimation to 1.97 (95% CI: 1.76-2.20), demonstrating quantifiable impact of incorporating eligibility alignment. EligMeta bridges automated trial discovery with eligibility-aware meta-analysis, providing a scalable and reproducible framework for evidence synthesis in precision medicine.
This paper provides a statistical analysis of three common methods of regression for Poisson data in the presence of Poisson background, namely the joint fit with two parametric models for the source and the background, the use of a non-parametric model for the background known as the wstat method, and the regression with a fixed background. The non-parametric background method, which is a popular method for spectral data, is found to be significantly biased, especially in the low-count and background-dominated regimes. Similar conclusions apply to the fixed-background regression. The joint-fit method, on the other hand, simultaneously affords reliable hypothesis testing by means of the usual Cash statistic and unbiased reconstruction of source parameters. We also investigate the effect of non-parametric regression on the number of effective degrees of freedom by means of the Efron degree of freedom function. We find that the wstat method adds a significantly larger number of degrees of freedom, compared to the number of free parameters in the source model. The other two methods have a number of degrees of freedom consistent with the number of adjustable parameters, at least for the simple models investigated in this paper.
Statistical methods are indispensable to scientific inference. However, there exists a longstanding tension across a wide range of scientific disciplines about the role that ``context'' should play in the application of statistical methods and the interpretation of statistical results. Though frequently invoked, the notion of ``scientific context'' refers to at least two distinct concepts: a set of foundational nuanced and elusive background assumptions and substantive features of a given area of study that shape the validity and reliability of statistical methods; and more quantifiable contextual issues that affect the performance of statistical methods and interpretation of statistical results. I argue here that the application and interpretation of statistical methods requires careful consideration of foundational contextual issues. To motivate the arguments, I review a recent re-formulation of the $p$-value as a measure of divergence between an observed dataset and a set of assumptions used to construct statistical measures. I use this framework to illustrate the role that context plays in two randomized trials: on low-dose aspirin for pregnancy loss, and a new inhibitor of a key biochemical pathway affecting ankylosing spondylitis. Finally, I note that the adoption of low significance thresholds in genome-wide association studies and high energy particle physics has been successful more so because of extensive validity-checking gauntlets and contextual considerations that have accompanied these low thresholds, not because of the low thresholds themselves. I use these illustrations and arguments to suggest that (i) the adoption of a universal threshold for significance testing should be abandoned as a goal of statistics reform; and (ii) the validity and optimal use of applied statistical tools requires careful consideration of nuanced scientific context.
Large-scale online platforms and marketplace systems often evaluate new policies through experiments that randomize treatment across operational units (e.g., geographies, regions, or clusters) over many time periods. In these settings, standard A/B testing can be inefficient or unreliable due to a limited number of units, substantial cross-unit heterogeneity, non-stationarity, and potential carryover across periods. We propose Sequentially-Rerandomized Switchback Experiments (SRSB), a new experimental design that helps mitigate these challenges. SRSB re-randomizes treatment at each time period such as to enforce balance on pre-specified prognostic variables constructed from past observations. In the absence of carryover, SRSB improves precision by leveraging temporal dependence through balancing lagged outcomes and covariates; we develop finite-sample randomization inference under a sharp null as well as asymptotic inference as the number of periods grows. We then extend SRSB to settings with first-order carryover and introduce a blocked SRSB variant that rerandomizes within strata defined by the previous treatment to form stable and comparable "stay" groups. Extensive simulations demonstrate the practical gains and robustness of SRSB relative to standard switchback designs.
Microbial interaction networks can rewire in response to host and environmental factors, yet most existing methods for network estimation treat the covariance structure as static across samples. We propose TRECOR, a Bayesian covariance regression framework for inferring covariate-dependent microbial covariation networks from zero-inflated compositional count data. The method models microbiome counts through a latent multivariate normal distribution defined on the internal nodes of a phylogenetic tree, where both the mean and covariance of the latent variables depend on covariates. The covariance is decomposed into a sparse baseline component, representing a stable microbial covariation network, and a low-rank covariate-dependent perturbation that captures network rewiring. By exploiting the binomial factorization of the multinomial distribution under the logistic-tree-normal representation, the model achieves full conjugacy and posterior inference proceeds via an efficient Gibbs sampler. In simulations, TRECOR substantially outperforms covariance regression applied to transformed counts, demonstrating the importance of explicitly modeling the compositional sampling layer. Applied to gut microbiome data from 531 individuals across three countries, we find that age has the largest effect on microbial covariation, which is a pattern not revealed by mean-based analysis alone. The age-associated differential network is enriched for Enterobacteriaceae and related families, consistent with known developmental shifts in the gut microbiota, while country-associated differential networks implicate diet-related taxa.
Probabilistic forecasts must sum to unity and cannot express ``I don't know.'' Possibility theory relaxes this constraint: a subnormal distribution explicitly measures how much of the plausibility budget remains unassigned, ignorance signal that probability cannot represent. This paper develops a verification framework for such forecasts, centred on a five-number scorecard that separately diagnoses whether the forecast pointed at the right outcome (depth-of-truth), how sharply (diffuseness, support margin), how confidently (ignorance), and how dominantly (conditional necessity). A possibility-to-probability conversion preserves ignorance for familiar frequency-based scoring; categorical threshold scores (POD, FAR, CSI, etc.) connect to operational practice. Together, these three complementary facets -- possibilistic, probabilistic, and categorical -- expose failure modes invisible to any single metric. Storm Prediction Center convective outlook categories serve as the running example throughout; a synthetic reforecast demonstrates diagnostic visualisations and scorecard interpretation. Ignorance is better expressed than repressed.
We study how sampling geometry contributes to uncertainty in modeling spatial geophysical observations as sampled random fields characterized by stationary, isotropic, parametric covariance functions. We incorporate the signature of discrete spatial sampling patterns into an asymptotically unbiased spectral maximum-likelihood estimation method along with analytical uncertainty calculation. We illustrate the broad applicability of our modeling through synthetic and real data examples with sampling patterns that include irregularly bounded contiguous region(s) of interest, structured sweeps of instrumental measurements, and missing observations dispersed across the domain of a field, from which contiguous patches are generally favorable. We find through asymptotic studies that allocating samples following a growing-domain strategy rather than a densifying, infill scheme best reduces estimator bias and (co)variance, whether the field has been sampled regularly or not. As our modeling assumptions, too, shape how (well) an observed random field can be characterized, we study the effect of covariance parameters assumed a priori. We demonstrate the desirable behavior of the general Matern class and show how to interrogate goodness-of-fit criteria to detect departures from the null hypothesis of Gaussianity, stationarity, and isotropy.