Table of Contents
Fetching ...

Hypothesis tests and model parameter estimation on data sets with missing correlation information

Lukas Koch

TL;DR

This paper addresses statistical analyses when full inter-point covariance information is unavailable, proposing robust simple-hypothesis tests (fitted, p_min, and f_max variants) and a derating strategy to inflate parameter uncertainties under unknown correlations. The fitted statistic minimizes the Mahalanobis distance over feasible off-diagonal blocks and leads to a conservative Cee-squared distribution for p-values; an algorithmic whitening-based approach yields worst-case derating factors to preserve coverage up to a chosen level (e.g., $\gamma=0.997$). The methods are demonstrated on neutrino interaction data (neutrino tune comparisons, cross-section tests) and extended to Goodness of Fit and composite hypotheses. The work emphasizes practical guidance for combining results with partial correlation information and provides software implementations in NuStatTools.

Abstract

Ideally, all analyses of normally distributed data should include the full covariance information between all data points. In practice, the full covariance matrix between all data points is not always available. Either because a result was published without a covariance matrix, or because one tries to combine multiple results from separate publications. For simple hypothesis tests, it is possible to define robust test statistics that will behave conservatively in the presence on unknown correlations. For model parameter fits, one can inflate the variance by a factor to ensure that things remain conservative at least up to a chosen confidence level. This paper describes a class of robust test statistics for simple hypothesis tests, as well as an algorithm to determine the necessary inflation factor for model parameter fits and Goodness of Fit tests and composite hypothesis tests. It then presents some example applications of the methods to real neutrino interaction data and model comparisons.

Hypothesis tests and model parameter estimation on data sets with missing correlation information

TL;DR

This paper addresses statistical analyses when full inter-point covariance information is unavailable, proposing robust simple-hypothesis tests (fitted, p_min, and f_max variants) and a derating strategy to inflate parameter uncertainties under unknown correlations. The fitted statistic minimizes the Mahalanobis distance over feasible off-diagonal blocks and leads to a conservative Cee-squared distribution for p-values; an algorithmic whitening-based approach yields worst-case derating factors to preserve coverage up to a chosen level (e.g., ). The methods are demonstrated on neutrino interaction data (neutrino tune comparisons, cross-section tests) and extended to Goodness of Fit and composite hypotheses. The work emphasizes practical guidance for combining results with partial correlation information and provides software implementations in NuStatTools.

Abstract

Ideally, all analyses of normally distributed data should include the full covariance information between all data points. In practice, the full covariance matrix between all data points is not always available. Either because a result was published without a covariance matrix, or because one tries to combine multiple results from separate publications. For simple hypothesis tests, it is possible to define robust test statistics that will behave conservatively in the presence on unknown correlations. For model parameter fits, one can inflate the variance by a factor to ensure that things remain conservative at least up to a chosen confidence level. This paper describes a class of robust test statistics for simple hypothesis tests, as well as an algorithm to determine the necessary inflation factor for model parameter fits and Goodness of Fit tests and composite hypothesis tests. It then presents some example applications of the methods to real neutrino interaction data and model comparisons.

Paper Structure

This paper contains 13 sections, 47 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: CDFs (left) for the "naive" squared M-distance test statistic for different levels of correlations in the data. When using the uncorrelated CDF to calculate the assumed significance level (or p-value) of a value of the statistic, the actual level will differ from the assumption depending on the correlations (right). Where the real significance level is larger than the assumed significance level (the real significance is weaker than the assumed one), the test statistic shows undercoverage. This corresponds to the region in the CDF where the real CDF for a given value is lower than the expected one.
  • Figure 2: CDFs (left) for the "fitted" test statistic for different levels of correlations in the data. When using the uncorrelated CDF to calculate the assumed significance level (or p-value) of a value of the statistic, the actual level will differ from the assumption depending on the correlations (right). In the presence of correlations, the real significance is consistently higher (the significance level is lower) than the assumption. This means the uncertainties are overestimated and the statistic behaves conservatively.
  • Figure 3: Illustration of the robustness of $\mathop{\mathrm{f_{\max}}}\nolimits$ test statistics for 2 blocks. If there are no correlations between the blocks, the $y$ variables will be independently uniformly distributed, and the expected CDF for the $\mathop{\mathrm{f_{\max}}}\nolimits$ statistic as a function of $z$ is equal to the area $A = \prod_i (1 - p_i)$. In the presence of unaccounted correlations, the "worst case" is if the $y$ variables are $100\%$ correlated, and all probability is concentrated along the diagonal line. In this case, the actual CDF is equal to $x = (1-p_{\max}) > A$. This means the assumed p-value $1 - A$ is bigger than the real p-value $p_{\max}$, and the $\mathop{\mathrm{f_{\max}}}\nolimits$ test statistic is conservative.
  • Figure 4: Derivative of the functions used for the $\mathop{\mathrm{optimal-f_{\max}}}\nolimits$ test statistic for different number of degrees of freedom. Only the parameter range up to the mode of the $\chi^2$ distribution is shown. The derivative is strictly positive. For large number of degrees of freedom, limits in the numerical precision mean that the CDF is evaluated as exactly 0 for small $x$. This limits the range where the derivative can be calculated. The derivative is always positive, so the function is suitable for an $\mathop{\mathrm{f_{\max}}}\nolimits$ statistic.
  • Figure 5: CDFs (left) for the "naive" squared M-distance in the projected parameter space for different levels of correlations in the data. This is equivalent to a parameter estimation by running a fit. When using the uncorrelated CDF to calculate the assumed significance level (or p-value) of a value of the statistic, the actual level will differ from the assumption depending on the correlations (right). Where the real significance level is larger than the assumed significance level (the real significance is weaker than the assumed one), the test statistic shows undercoverage. This corresponds to the region in the CDF where the real CDF for a given value is lower than the expected one.
  • ...and 5 more figures