Data Analysis, Statistics and Probability

arXiv:physics.data-an

Methods, software and hardware for physics data analysis.

Trending in Data Analysis, Statistics and Probability

Accurate Estimation of Mutual Information in High Dimensional Data

Mutual information (MI) is a fundamental measure of statistical dependence between two variables, yet accurate estimation from finite data remains notoriously difficult. No estimator is universally reliable, and common approaches fail in the high-dimensional, undersampled regimes typical of modern experiments. Recent machine learning-based estimators show promise, but their accuracy depends sensitively on dataset size, structure, and hyperparameters, with no accepted tests to detect failures. We close these gaps through a systematic evaluation of classical and neural MI estimators across standard benchmarks and new synthetic datasets tailored to challenging high-dimensional, undersampled regimes. We contribute: (i) a practical protocol for reliable MI estimation with explicit checks for statistical consistency; (ii) confidence intervals (error bars around estimates) that existing neural MI estimator do not provide; and (iii) a new class of probabilistic critics designed for high-dimensional, high-information settings. We demonstrate the effectiveness of our protocol with computational experiments, showing that it consistently matches or surpasses existing methods while uniquely quantifying its own reliability. We show that reliable MI estimation is sometimes achievable even in severely undersampled, high-dimensional datasets, provided they admit accurate low-dimensional representations. This broadens the scope of applicability of neural MI estimators and clarifies when such estimators can be trusted.

2506.003302

May 2025Data Analysis, Statistics and Probability

A Centrality-independent Framework for Revealing Genuine Higher-Order Cumulants in Heavy-Ion Collisions

We propose a novel centrality definition-independent method for analyzing higher-order cumulants, specifically addressing the challenge of volume fluctuations that dominate in low-energy heavy-ion collisions. This method reconstructs particle number distributions using the Edgeworth expansion, with parameters optimized via a combination of differential evolution algorithm and Bayesian inference. Its effectiveness is validated using UrQMD model simulations and benchmarked against traditional approaches, including centrality definitions based on particle multiplicity. Our results show that the proposed framework yields cumulant patterns consistent with those obtained using number of participant nucleon ($N_{\text{part}}$) based centrality observables, while eliminating the conventional reliance on centrality determination. This consistency confirms the method's ability to extract genuine physical signals, thereby paving the way for probing the intrinsic thermodynamic properties of the produced medium through event-by-event fluctuations.

2505.036661

May 2025Data Analysis, Statistics and Probability

Scalable Multi-Task Learning for Particle Collision Event Reconstruction with Heterogeneous Graph Neural Networks

The growing luminosity frontier at the Large Hadron Collider is challenging the reconstruction and analysis of particle collision events. Increased particle multiplicities are straining latency and storage requirements at the data acquisition stage, while new complications are emerging, including higher background levels and more frequent particle vertex misassociations. This in turn necessitates the development of more holistic and scalable reconstruction methods that take advantage of recent advances in machine learning. We propose a novel Heterogeneous Graph Neural Network (HGNN) architecture featuring unique representations for diverse particle collision relationships and integrated graph pruning layers for scalability. Trained with a multi-task paradigm in an environment mimicking the LHCb experiment, this HGNN significantly improves beauty hadron reconstruction performance. Notably, it concurrently performs particle vertex association and graph pruning within a single framework. We quantify reconstruction and pruning performance, demonstrate enhanced inference time scaling with event complexity, and mitigate potential performance loss using a weighted message passing scheme.

2504.218442

Apr 2025Data Analysis, Statistics and Probability

Unbinned Inference with Correlated Events

Modern machine learning has enabled parameter inference from event-level data without the need to first summarize all events with a histogram. All of these unbinned inference methods make use of the fact that the events are statistically independent so that the log likelihood is a sum over events. However, this assumption is not valid for unbinned inference on unfolded data, where the deconvolution process induces a correlation between events. We explore the impact of event correlations on downstream inference tasks in the context of the OmniFold unbinned unfolding method. We find that uncertainties may be significantly underestimated when event correlations are excluded from uncertainty quantification.

2504.140723

Apr 2025Data Analysis, Statistics and Probability

Orthogonal projections of hypercubes

Projections of hypercubes have been applied to visualize high-dimensional binary state spaces in various scientific fields. Conventional methods for projecting hypercubes, however, face practical difficulties. Manual methods require nontrivial adjustments of the projection basis, while optimization-based algorithms limit the interpretability and reproducibility of the resulting plots. These limitations motivate us to explore theoretically analyzable projection algorithms such as principal component analysis (PCA). Here, we investigate the mathematical properties of PCA-projected hypercubes. Our numerical and analytical results show that PCA effectively captures polarized distributions within the hypercubic state space. This property enables the assessment of the asymptotic distribution of projected vertices and error bounds, which characterize the performance of PCA in the projected space. We demonstrate the application of PCA to visualize the hypercubic energy landscapes of Ising spin systems, specifically finite artificial spin-ice systems, including those with geometric frustration. By adding projected hypercubic edges, these visualizations reveal pathways of correlated spin flips. We confirm that the time-integrated probability flux exhibits patterns consistent with the pathways identified in the projected hypercubic energy landscapes. Using the mean-field model, we show that dominant state transition pathways tend to emerge around the periphery of the projected hypercubes. Our work provides a better understanding of how PCA discovers hidden patterns in high-dimensional binary data.

2501.102572

Jan 2025Data Analysis, Statistics and Probability

Large Physics Models: Towards a collaborative approach with Large Language Models and Foundation Models

This paper explores ideas and provides a potential roadmap for the development and evaluation of physics-specific large-scale AI models, which we call Large Physics Models (LPMs). These models, based on foundation models such as Large Language Models (LLMs) - trained on broad data - are tailored to address the demands of physics research. LPMs can function independently or as part of an integrated framework. This framework can incorporate specialized tools, including symbolic reasoning modules for mathematical manipulations, frameworks to analyse specific experimental and simulated data, and mechanisms for synthesizing theories and scientific literature. We begin by examining whether the physics community should actively develop and refine dedicated models, rather than relying solely on commercial LLMs. We then outline how LPMs can be realized through interdisciplinary collaboration among experts in physics, computer science, and philosophy of science. To integrate these models effectively, we identify three key pillars: Development, Evaluation, and Philosophical Reflection. Development focuses on constructing models capable of processing physics texts, mathematical formulations, and diverse physical data. Evaluation assesses accuracy and reliability by testing and benchmarking. Finally, Philosophical Reflection encompasses the analysis of broader implications of LLMs in physics, including their potential to generate new scientific understanding and what novel collaboration dynamics might arise in research. Inspired by the organizational structure of experimental collaborations in particle physics, we propose a similarly interdisciplinary and collaborative approach to building and refining Large Physics Models. This roadmap provides specific objectives, defines pathways to achieve them, and identifies challenges that must be addressed to realise physics-specific large scale AI models.

2501.0538213

Jan 2025Data Analysis, Statistics and Probability

Learning Efficient Representations of Neutrino Telescope Events

Neutrino telescopes detect rare interactions of particles produced in some of the most extreme environments in the Universe. This is accomplished by instrumenting a cubic-kilometer scale volume of naturally occurring transparent medium with light sensors. Given their substantial size and the high frequency of background interactions, these telescopes amass an enormous quantity of large variance, high-dimensional data. These attributes create substantial challenges for analyzing and reconstructing interactions, particularly when utilizing machine learning (ML) techniques. In this paper, we present a novel approach, called om2vec, that employs transformer-based variational autoencoders to efficiently represent the detected photon arrival time distributions of neutrino telescope events by learning compact and descriptive latent representations. We demonstrate that these latent representations offer enhanced flexibility and improved computational efficiency, thereby facilitating downstream tasks in data analysis.

2410.131481

Oct 2024Data Analysis, Statistics and Probability

2407.14489

Investigating event-shape methods in the search for the chiral magnetic effect in relativistic heavy ion collisions

The Chiral Magnetic Effect (CME) is a phenomenon in which electric charge is separated by a strong magnetic field from local domains of chirality imbalance and parity violation in quantum chromodynamics (QCD). The CME-sensitive observable, charge-dependent three-point azimuthal correlator $Δγ$, is contaminated by a major physics background proportional to the particle's elliptic flow anisotropy $v_2$. Event-shape engineering (ESE) binning events in dynamical fluctuations of $v_2$ and event-shape selection (ESS) binning events in statistical fluctuations of $v_2$ are two methods to search for the CME by projecting $Δγ$ to the measured anisotropy $v_2=0$ intercept. We conduct a systematic study of these two methods using physics models as well as toy model simulations. It is observed that the ESE method fulfills the general premise of measuring the CME but is statistically hungry. It is found that the intercept from the ESS method depends on the details of the event content, such as the mixtures of background-contributing sources, because of statistical fluctuations of intertwining variables used in the method, and is thus not practically useful or clean to measure the CME.

2407.144891

Jul 2024Data Analysis, Statistics and Probability

Extracting self-similarity from data

Identifying self-similarity is key to understanding and modelling a plethora of phenomena in fluid mechanics. Unfortunately, this is not always possible to perform formally in highly complex flows. We propose a methodology to extract the similarity variables of a self-similar physical process directly from data, without prior knowledge of the governing equations or boundary conditions, based on an optimization problem and symbolic regression. We analyze the accuracy and robustness of our method in five problems which have been influential in fluid mechanics research: a laminar boundary layer, Burger's equation, a turbulent wake, a collapsing cavity, and decaying turbulence. Our analysis considers datasets acquired via both numerical and wind tunnel experiments. The algorithm recovers the known self-similarity expressions in the first four problems and generates new insights on single length scale theories of homogeneous turbulence.

2407.107241

Jul 2024Data Analysis, Statistics and Probability

A simple tool for weighted averaging of inconsistent data sets

The weighted average of inconsistent data is a common and tedious problem that many scientists have encountered. The standard weighted average is not recommended for these cases, and various alternative methods have been proposed. These approaches vary in suitability depending on the nature of the data, which can make selecting the appropriate method difficult without expertise in metrology or statistics. For the analysis of simple data sets presenting inconsistencies, we discuss the method proposed by Sivia in 1996 based on Bayesian statistics. This choice has the intention of maintaining generality while minimising the number of assumptions. In this approach, the uncertainty associated with each input value is considered to be just a lower bound of the true unknown uncertainty. The resulting likelihood function is no longer Gaussian but has smoothly decreasing wings, which allows for a better treatment of scattered data and outliers. To demonstrate the robustness and the generality of the method, we apply it to a series of critical data sets: simulations, CODATA recommended values of the Newtonian gravitational constant, and some particle properties from the Particle Data Group, including the proton charge radius. A freely available Python library is also provided for a simple implementation of the proposed averaging method.

2406.082933

Jun 2024Data Analysis, Statistics and Probability

Learning effective good variables from physical data

We assume that a sufficiently large database is available, where a physical property of interest and a number of associated ruling primitive variables or observables are stored. We introduce and test two machine learning approaches to discover possible groups or combinations of primitive variables: The first approach is based on regression models whereas the second on classification models. The variable group (here referred to as the new effective good variable) can be considered as successfully found, when the physical property of interest is characterized by the following effective invariant behaviour: In the first method, invariance of the group implies invariance of the property up to a given accuracy; in the other method, upon partition of the physical property values into two or more classes, invariance of the group implies invariance of the class. For the sake of illustration, the two methods are successfully applied to two popular empirical correlations describing the convective heat transfer phenomenon and to the Newton's law of universal gravitation.

2401.052263

Jan 2024Data Analysis, Statistics and Probability

Robust reconstruction of sparse network dynamics

Reconstruction of the network interaction structure from multivariate time series is an important problem in multiple fields of science. This problem is ill-posed for large networks leading to the reconstruction of false interactions. We put forward the Ergodic Basis Pursuit (EBP) method that uses the network dynamics' statistical properties to ensure the exact reconstruction of sparse networks when a minimum length of time series is attained. We show that this minimum time series length scales quadratically with the node degree being probed and logarithmic with the network size. Our approach is robust against noise and allows us to treat the noise level as a parameter. We show the reconstruction power of the EBP in experimental multivariate time series from optoelectronic networks.

2308.064332

Aug 2023Data Analysis, Statistics and Probability

Inference in conditioned dynamics through causality restoration

Computing observables from conditioned dynamics is typically computationally hard, because, although obtaining independent samples efficiently from the unconditioned dynamics is usually feasible, generally most of the samples must be discarded (in a form of importance sampling) because they do not satisfy the imposed conditions. Sampling directly from the conditioned distribution is non-trivial, as conditioning breaks the causal properties of the dynamics which ultimately renders the sampling procedure efficient. One standard way of achieving it is through a Metropolis Monte-Carlo procedure, but this procedure is normally slow and a very large number of Monte-Carlo steps is needed to obtain a small number of statistically independent samples. In this work, we propose an alternative method to produce independent samples from a conditioned distribution. The method learns the parameters of a generalized dynamical model that optimally describe the conditioned distribution in a variational sense. The outcome is an effective, unconditioned, dynamical model, from which one can trivially obtain independent samples, effectively restoring causality of the conditioned distribution. The consequences are twofold: on the one hand, it allows us to efficiently compute observables from the conditioned dynamics by simply averaging over independent samples. On the other hand, the method gives an effective unconditioned distribution which is easier to interpret. The method is flexible and can be applied virtually to any dynamics. We discuss an important application of the method, namely the problem of epidemic risk assessment from (imperfect) clinical tests, for a large family of time-continuous epidemic models endowed with a Gillespie-like sampler. We show that the method compares favorably against the state of the art, including the soft-margin approach and mean-field methods.

2210.101794

Oct 2022Data Analysis, Statistics and Probability

ABCNet: An attention-based method for particle tagging

In high energy physics, graph-based implementations have the advantage of treating the input data sets in a similar way as they are collected by collider experiments. To expand on this concept, we propose a graph neural network enhanced by attention mechanisms called ABCNet. To exemplify the advantages and flexibility of treating collider data as a point cloud, two physically motivated problems are investigated: quark-gluon discrimination and pileup reduction. The former is an event-by-event classification while the latter requires each reconstructed particle to receive a classification score. For both tasks ABCNet shows an improved performance compared to other algorithms available.

2001.0531188

Jan 2020Data Analysis, Statistics and Probability

Data Unfolding Methods in High Energy Physics

A selection of unfolding methods commonly used in High Energy Physics is compared. The methods discussed here are: bin-by-bin correction factors, matrix inversion, template fit, Tikhonov regularisation and two examples of iterative methods. Two procedures to choose the strength of the regularisation are tested, namely the L-curve scan and a scan of global correlation coefficients. The advantages and disadvantages of the unfolding methods and choices of the regularisation strength are discussed using a toy example.

1611.0192751

Nov 2016Data Analysis, Statistics and Probability

Handling uncertainties in background shapes: the discrete profiling method

A common problem in data analysis is that the functional form, as well as the parameter values, of the underlying model which should describe a dataset is not known a priori. In these cases some extra uncertainty must be assigned to the extracted parameters of interest due to lack of exact knowledge of the functional form of the model. A method for assigning an appropriate error is presented. The method is based on considering the choice of functional form as a discrete nuisance parameter which is profiled in an analogous way to continuous nuisance parameters. The bias and coverage of this method are shown to be good when applied to a realistic example.

1408.6865101

Aug 2014Data Analysis, Statistics and Probability

On the combination of correlated estimates of a physics observable

The combination of a number of correlated estimates of a given observable is frequently performed using the Best Linear Unbiased Estimate (BLUE) method. Most features of such a combination can already be seen by analysing the special case of a pair of estimates from two correlated estimators of the observable. Two important parameters of this combination are the weight of the less precise estimate and the ratio of uncertainties of the combined result and the more precise estimate. Derivatives of these quantities are derived with respect to the correlation and the ratio of uncertainties of the two estimates. The impact of using either absolute or relative uncertainties in the BLUE combination is investigated on a number of examples including Peelle's Pertinent Puzzle. Using an example, a critical assessment is performed of suggested methods to deal with the fact that both the correlation and the ratio of uncertainties of a pair of estimates are typically only known with some uncertainty. Finally, a proposal is made to decide on the usefulness of a combination and to perform it. The proposal is based on possible improvements with respect to the most precise estimate by including additional estimates. This procedure can be applied to the general case of several observables.

1402.401687

Feb 2014Data Analysis, Statistics and Probability

TUnfold: an algorithm for correcting migration effects in high energy physics

TUnfold is a tool for correcting migration and background effects in high energy physics for multi-dimensional distributions. It is based on a least square fit with Tikhonov regularisation and an optional area constraint. For determining the strength of the regularisation parameter, the L-curve method and scans of global correlation coefficients are implemented. The algorithm supports background subtraction and error propagation of statistical and systematic uncertainties, in particular those originating from limited knowledge of the response matrix. The program is interfaced to the ROOT analysis framework.

1205.6201186

May 2012Data Analysis, Statistics and Probability

An Iterative, Dynamically Stabilized(IDS) Method of Data Unfolding

We describe an iterative unfolding method for experimental data, making use of a regularization function. The use of this function allows one to build an improved normalization procedure for Monte Carlo spectra, unbiased by the presence of possible new structures in data. We unfold, in a dynamically stable way, data spectra which can be strongly affected by fluctuations in the background subtraction and simultaneously reconstruct structures which were not initially simulated.

1106.310786

Jun 2011Data Analysis, Statistics and Probability

Unfolding algorithms and tests using RooUnfold

The RooUnfold package provides a common framework to evaluate and use different unfolding algorithms, side-by-side. It currently provides implementations or interfaces for the Iterative Bayes, Singular Value Decomposition, and TUnfold methods, as well as bin-by-bin and matrix inversion reference methods. Common tools provide covariance matrix evaluation and multi-dimensional unfolding. A test suite allows comparisons of the performance of the algorithms under different truth and measurement models. Here I outline the package, the unfolding methods, and some experience of their use.

1105.1160655

May 2011Data Analysis, Statistics and Probability

Related categories:

Accelerator Physics Atmospheric and Oceanic Physics Applied Physics Atomic and Molecular Clusters Atomic Physics

58 papers

Machine Learning

This chapter gives an overview of the core concepts of machine learning (ML) -- the use of algorithms that learn from data, identify patterns, and make predictions or decisions without being explicitly programmed -- that are relevant to particle physics with some examples of applications to the energy, intensity, cosmic, and accelerator frontiers.

2512.11133Dec 2025

View

Automating High Energy Physics Data Analysis with LLM-Powered Agents

We present a proof-of-principle study demonstrating the use of large language model (LLM) agents to automate a representative high energy physics (HEP) analysis. Using the Higgs boson diphoton cross-section measurement as a case study with ATLAS Open Data, we design a hybrid system that combines an LLM-based supervisor-coder agent with the Snakemake workflow manager. In this architecture, the workflow manager enforces reproducibility and determinism, while the agent autonomously generates, executes, and iteratively corrects analysis code in response to user instructions. We define quantitative evaluation metrics including success rate, error distribution, costs per specific task, and average number of API calls, to assess agent performance across multi-stage workflows. To characterize variability across architectures, we benchmark a representative selection of state-of-the-art LLMs spanning the Gemini and GPT-5 series, the Claude family, and leading open-weight models. While the workflow manager ensures deterministic execution of all analysis steps, the final outputs still show stochastic variation. Although we set the temperature to zero, other sampling parameters (e.g., top-p, top-k) remained at their defaults, and some reasoning-oriented models internally adjust these settings. Consequently, the models do not produce fully deterministic results. This study establishes the first LLM-agent-driven automated data-analysis framework in HEP, enabling systematic benchmarking of model capabilities, stability, and limitations in real-world scientific computing environments. The baseline code used in this work is available at https://huggingface.co/HWresearch/LLM4HEP. This work was accepted as a poster at the Machine Learning and the Physical Sciences (ML4PS) workshop at NeurIPS 2025. The initial submission was made on August 30, 2025.

2512.07785Dec 2025

View

Integral Bayesian symbolic regression for optimal discovery of governing equations from scarce and noisy data

Understanding how systems evolve over time often requires discovering the differential equations that govern their behavior. Automatically learning these equations from experimental data is challenging when the data are noisy or limited, and existing approaches struggle, in particular, with the estimation of unobserved derivatives. Here, we introduce an integral Bayesian symbolic regression method that learns governing equations directly from raw time-series data, without requiring manual assumptions or error-prone derivative estimation. By sampling the space of symbolic differential equations and evaluating them via numerical integration, our method robustly identifies governing equations even from noisy or scarce data. We show that this approach accurately recovers ground-truth models in synthetic benchmarks, and that it makes quasi-optimal predictions of system dynamics for all noise regimes. Applying this method to bacterial growth experiments across multiple species and substrates, we discover novel growth equations that outperform classical models in accurately capturing all phases of microbial proliferation, including lag, exponential, and saturation. Unlike standard approaches, our method reveals subtle shifts in growth dynamics, such as double ramp-ups or non-canonical transitions, offering a deeper, data-driven understanding of microbial physiology.

2511.14388Nov 2025

View

Human-aligned Quantification of Numerical Data

Quantifying numerical data involves addressing two key challenges: first, determining whether the data can be naturally quantified, and second, identifying the numerical intervals or ranges of values that correspond to specific value classes, referred to as "quantums," which represent statistically meaningful states. If such quantification is feasible, continuous streams of numerical data can be transformed into sequences of "symbols" that reflect the states of the system described by the measured parameter. People often perform this task intuitively, relying on common sense or practical experience, while information theory and computer science offer computable metrics for this purpose. In this study, we assess the applicability of metrics based on information compression and the Silhouette coefficient for quantifying numerical data. We also investigate the extent to which these metrics correlate with one another and with what is commonly referred to as "human intuition." Our findings suggest that the ability to classify numeric data values into distinct categories is associated with a Silhouette coefficient above 0.65 and a Dip Test below 0.5; otherwise, the data can be treated as following a unimodal normal distribution. Furthermore, when quantification is possible, the Silhouette coefficient appears to align more closely with human intuition than the "normalized centroid distance" method derived from information compression perspective.

2511.15723Nov 2025

View

The High W Challenge: Robust Neutrino Energy Estimators for LArTPCs

Accurate determination of the neutrino energy is central to precision oscillation measurements. In this work, we introduce the W$^2$-based estimator, a new neutrino energy estimator based on the measurement of the final-state hadronic invatiant mass. This estimator is particularly designed to be employed in liquid-argon time-projection chambers exposed to broadband beams that span the challenging transition region between shallow inelastic scattering and deep inelastic scattering. The performance of the W$^2$-based estimator is compared against four other commonly used estimators. The impact of the estimator choice is evaluated by performing measurements of $δ_{CP}$ and $Δm^2_{23}$ in a toy long-baseline oscillation analysis. We find that the W$^2$-based estimator shows the smallest bias as a funciton of true neutrino energy and it is particularly stable against the mismodelling of lepton scattering angle, missing energy, hadronic invariant mass and final state interactions. Such an inclusive channel complements well the strength of more exclusive methods that optimizes the energy resolution. By providing a detailed analysis of strengths, weaknesses and domain of applicability of each estimator, this work informs the combined used of energy estimators in any future LArTPC-based oscillation analysis.

2511.11149Nov 2025

View

2510.25162

Response to Comment from Robert Cousins on Confidence intervals for the Poisson distribution

Robert Cousins has posted a comment on my manuscript on ``Confidence intervals for the Poisson distribution''. His key point is that one should not include in the likelihood non-physical parameter values, even for frequency statistics. This is my response, in which I contend that it can be useful to do so when discussing such descriptive statistics.

2510.25162Oct 2025

View

Information-theoretic analysis of temporal dependence in discrete stochastic processes: Application to precipitation predictability

Understanding the temporal dependence of precipitation is key to improving weather predictability and developing efficient stochastic rainfall models. We introduce an information-theoretic approach to quantify memory effects in discrete stochastic processes and apply it to daily precipitation records across the contiguous United States. The method is based on the predictability gain, a quantity derived from block entropy that measures the additional information provided by higher-order temporal dependencies. This statistic, combined with a bootstrap-based hypothesis testing and Fisher's method, enables a robust memory estimator from finite data. Tests with generated sequences show that this estimator outperforms other model-selection criteria such as AIC and BIC. Applied to precipitation data, the analysis reveals that daily rainfall occurrence is well described by low-order Markov chains, exhibiting regional and seasonal variations, with stronger correlations in winter along the West Coast and in summer in the Southeast, consistent with known climatological patterns. Overall, our findings establish a framework for building parsimonious stochastic descriptions, useful when addressing spatial heterogeneity in the memory structure of precipitation dynamics, and support further advances in real-time, data-driven forecasting schemes.

2510.11276Oct 2025

View

Efficiency correction of particle-averaged quantities

We derive analytic formulas to reconstruct particle-averaged quantities from experimental results that suffer from the efficiency loss of particle measurements. These formulas are derived under the assumption that the probabilities of observing individual particles are independent. The formulas do not agree with the conventionally used intuitive formulas.

2510.13838Oct 2025

View

Optimal Binning for Small-Angle Neutron Scattering Data Using the Freedman-Diaconis Rule

Small-Angle Neutron Scattering (SANS) data analysis often relies on fixed-width binning schemes that overlook variations in signal strength and structural complexity. We introduce a statistically grounded approach based on the Freedman-Diaconis (FD) rule, which minimizes the mean integrated squared error between the histogram estimate and the true intensity distribution. By deriving the competing scaling relations for counting noise ($\propto h^{-1}$) and binning distortion ($\propto h^{2}$), we establish an optimal bin width that balances statistical precision and structural resolution. Application to synthetic data from the Debye scattering function of a Gaussian polymer chain demonstrates that the FD criterion quantitatively determines the most efficient binning, faithfully reproducing the curvature of $I(Q)$ while minimizing random error. The optimal width follows the expected scaling $h_{\mathrm{opt}} \propto N_{\mathrm{total}}^{-1/3}$, delineating the transition between noise- and resolution-limited regimes. This framework provides a unified, physics-informed basis for adaptive, statistically efficient binning in neutron scattering experiments.

2510.09581Oct 2025

View

Higher-order spacings in the superposed spectra of random matrices with comparison to spacing ratios and application to complex systems

The connection between random matrices and the spectral fluctuations of complex quantum systems in a suitable limit can be explained by using the setup of random matrix theory. Higher-order spacing statistics in the $m$ superposed spectra of circular random matrices are studied numerically. We tabulated the modified Dyson index $β'$ for a given $m$, $k$, and $β$, for which the nearest neighbor spacing distribution is the same as that of the $k$-th order spacing distribution corresponding to the $β$ and $m$. Here, we conjecture that for given $m(k)$ and $β$, the obtained sequence of $β'$ as a function of $k(m)$ is unique. This result can be used as a tool for the characterization of the system and to determine the symmetry structure of the system without desymmetrization of the spectra. We verify the results of the $m=2$ case of COE with the quantum kicked top model corresponding to various Hilbert space dimensions. From the comparative study of the higher-order spacings and ratios in both $m=1$ and $m=2$ cases of COE and GOE by varying dimension, keeping the number of realizations constant and vice-versa, we find that both COE and GOE have the same asymptotic behavior in terms of a given higher-order statistics. But, we found from our numerical study that within a given ensemble of COE or GOE, the results of spacings and ratios agree with each other only up to some lower $k$, and beyond that, they start deviating from each other. It is observed that for the $k=1$ case, the convergence towards the Poisson distribution is faster in the case of ratios than the corresponding spacings as we increase $m$ for a given $β$. Further, the spectral fluctuations of the intermediate map of various dimensions are studied. There, we find that the effect of random numbers used to generate the matrix corresponding to the map is reflected in the higher-order statistics.

2510.00503Oct 2025

View

A Numerical Rosenblatt Method for Forced Variable Independence

A novel numerical technique is presented to transform one random variable within a system toward statistical quasi-independence from any other random variable in the system. The method's applicability is demonstrated through a particle physics example where a classifier is rendered quasi-independent from an observable quantity.

2509.25521Sep 2025

View

Resolving features and derivatives in data with noise

A frequently occurring challenge in experimental and numerical observation is how to resolve features, such as spectral peaks - with center, width, height - and derivatives from measured data with unavoidable noise. Therefore, we develop a modified Whittaker-Henderson smoothing procedure that balances the spectral features and the noise. In our procedure, we introduce adjustable weights that are optimized using cross-validation. When the measurement errors are known, a straightforward error analysis of the smoothed results is feasible. As an example, we calculate the optical group delay dispersion of a Bragg reflector from synthetic phase data with noise to illustrate the effectiveness of the smoothing algorithm. The smoother faithfully reconstructs the group delay dispersion, allowing to observe details that otherwise remain buried in noise. To further illustrate the power of our smoother, we study several commonly occurring difficulties in data and data analysis and show how to properly smoothen unequally sampled data, how to obtain discontinuities, including discontinuous derivatives or kinks, and how to properly smooth data in the vicinity of boundaries to the domains.

2509.22077Sep 2025

View

Particle Identification with MLPs and PINNs Using HADES Data

In experimental nuclear and particle physics, the extraction of high-purity samples of rare events critically depends on the efficiency and accuracy of particle identification (PID). In this work, we present a PID method applied to HADES data at the level of fully reconstructed particle track candidates. The results demonstrate a significant improvement in PID performance compared to conventional techniques, highlighting the potential of physics-informed neural networks as a powerful tool for future data analyses.

2509.17685Sep 2025

View

Setting the Standard: Recommended Practices for Data Preprocessing in Data-Driven Climate Prediction

Artificial intelligence (AI) - and specifically machine learning (ML) - applications for climate prediction across timescales are proliferating quickly. The emergence of these methods prompts a revisit to the impact of data preprocessing, a topic familiar to the climate community, as more traditional statistical models work with relatively small sample sizes. Indeed, the skill and confidence in the forecasts produced by data-driven models are directly influenced by the quality of the datasets and how they are treated during model development, thus yielding the colloquialism, "garbage in, garbage out." As such, this article establishes protocols for the proper preprocessing of input data for AI/ML models designed for climate prediction (i.e., subseasonal to decadal and longer). The three aims are to: (1) educate researchers, developers, and end users on the effects that preprocessing has on climate predictions; (2) provide recommended practices for data preprocessing for such applications; and (3) empower end users to decipher whether the models they are using are properly designed for their objectives. Specific topics covered in this article include the creation of (standardized) anomalies, dealing with non-stationarity and the spatiotemporally correlated nature of climate data, and handling of extreme values and variables with potentially complex distributions. Case studies will illustrate how using different preprocessing techniques can produce different predictions from the same model, which can create confusion and decrease confidence in the overall process. Ultimately, implementing the recommended practices set forth in this article will enhance the robustness and transparency of AI/ML in climate prediction studies.

2508.07062Aug 2025

View

Mind the Gap: Navigating Inference with Optimal Transport Maps

Machine learning (ML) techniques have recently enabled enormous gains in sensitivity to new phenomena across the sciences. In particle physics, much of this progress has relied on excellent simulations of a wide range of physical processes. However, due to the sophistication of modern machine learning algorithms and their reliance on high-quality training samples, discrepancies between simulation and experimental data can significantly limit their effectiveness. In this work, we present a solution to this ``misspecification'' problem: a model calibration approach based on optimal transport, which we apply to high-dimensional simulations for the first time. We demonstrate the performance of our approach through jet tagging, using a dataset inspired by the CMS experiment at the Large Hadron Collider. A 128-dimensional internal jet representation from a powerful general-purpose classifier is studied; after calibrating this internal ``latent'' representation, we find that a wide variety of quantities derived from it for downstream tasks are also properly calibrated: using this calibrated high-dimensional representation, powerful new applications of jet flavor information can be utilized in LHC analyses. This is a key step toward allowing the unbiased use of ``foundation models'' in particle physics. More broadly, this calibration framework has broad applications for correcting high-dimensional simulations across the sciences.

2507.08867Jul 2025

View

Transfer entropy for finite data

Transfer entropy is a widely used measure for quantifying directed information flows in complex systems. While the challenges of estimating transfer entropy for continuous data are well known, it has two major shortcomings for data of finite cardinality: it exhibits a substantial positive bias for sparse bin counts, and it has no clear means to assess statistical significance. By computing information content in finite data streams without explicitly considering symbols as instances of random variables, we derive a transfer entropy measure which is asymptotically equivalent to the standard plug-in estimator but remedies these issues for time series of small size and/or high cardinality, permitting a fully nonparametric assessment of statistical significance without simulation.

2506.16215Jun 2025

View

Accurate Estimation of Mutual Information in High Dimensional Data

2506.003302May 2025

View

Brightify: A tool for calculating directionally-resolved brightness in neutron sources

Brightness is a critical metric for optimizing the design of neutron sources and beamlines, yet there is no direct way to calculate brightness within most Monte Carlo packages used for neutron source simulation. In this paper, we present Brightify, an open-source Python-based tool designed to calculate brightness from Monte Carlo Particle List (MCPL) files, which can be extracted from several Monte Carlo simulation packages. Brightify provides an efficient computational approach to calculate brightness for any particle type and energy spectrum recorded in the MCPL file. It enables localized, directionally-resolved brightness evaluations by scanning across both spatial and angular domains, facilitating the identification of positions and directions corresponding to maximum brightness. This functionality is particularly valuable for identifying brightness hotspots and helping fine-tune the design of neutron sources for optimal performance. We validate Brightify against standard methods, such as surface current tally and point estimator tally, and demonstrate its accuracy and adaptability, particularly in high-resolution analyses. By overcoming the limitations of traditional methods, Brightify streamlines neutron source re-optimization, reduces computational burden, and accelerates source development workflows. The full code is available on the Brightify GitHub repository.

2505.22431May 2025

View

Comparative Analysis of Richardson-Lucy Deconvolution and Data Unfolding with Mean Integrated Square Error Optimization

Two maximum likelihood-based algorithms for unfolding or deconvolution are considered: the Richardson-Lucy method and the Data Unfolding method with Mean Integrated Square Error (MISE) optimization [10]. Unfolding is viewed as a procedure for estimating an unknown probability density function. Both external and internal quality assessment methods can be applied for this purpose. In some cases, external criteria exist to evaluate deconvolution quality. A typical example is the deconvolution of a blurred image, where the sharpness of the restored image serves as an indicator of quality. However, defining such external criteria can be challenging, particularly when a measurement has not been performed previously. In such instances, internal criteria are necessary to assess the quality of the result independently of external information. The article discusses two internal criteria: MISE for the unfolded distribution and the condition number of the correlation matrix of the unfolded distribution. These internal quality criteria are applied to a comparative analysis of the two methods using identical numerical data. The results of the analysis demonstrate the superiority of the Data Unfolding method with MISE optimization over the Richardson-Lucy method.

2505.10283May 2025

View

A Centrality-independent Framework for Revealing Genuine Higher-Order Cumulants in Heavy-Ion Collisions

2505.036661May 2025

View

Page 1 of 3