Table of Contents
Fetching ...

Better Together: Cross and Joint Covariances Enhance Signal Detectability in Undersampled Data

Arabind Swain, Sean Alexander Ridout, Ilya Nemenman

Abstract

Many data-science applications involve detecting a shared signal between two high-dimensional variables. Using random matrix theory methods, we determine when such signal can be detected and reconstructed from sample correlations, despite the background of sampling noise induced correlations. We consider three different covariance matrices constructed from two high-dimensional variables: their individual self covariance, their cross covariance, and the self covariance of the concatenated (joint) variable, which incorporates the self and the cross correlation blocks. We observe the expected Baik, Ben Arous, and Péché detectability phase transition in all these covariance matrices, and we show that joint and cross covariance matrices always reconstruct the shared signal earlier than the self covariances. Whether the joint or the cross approach is better depends on the mismatch of dimensionalities between the variables. We discuss what these observations mean for choosing the right method for detecting linear correlations in data and how these findings may generalize to nonlinear statistical dependencies.

Better Together: Cross and Joint Covariances Enhance Signal Detectability in Undersampled Data

Abstract

Many data-science applications involve detecting a shared signal between two high-dimensional variables. Using random matrix theory methods, we determine when such signal can be detected and reconstructed from sample correlations, despite the background of sampling noise induced correlations. We consider three different covariance matrices constructed from two high-dimensional variables: their individual self covariance, their cross covariance, and the self covariance of the concatenated (joint) variable, which incorporates the self and the cross correlation blocks. We observe the expected Baik, Ben Arous, and Péché detectability phase transition in all these covariance matrices, and we show that joint and cross covariance matrices always reconstruct the shared signal earlier than the self covariances. Whether the joint or the cross approach is better depends on the mismatch of dimensionalities between the variables. We discuss what these observations mean for choosing the right method for detecting linear correlations in data and how these findings may generalize to nonlinear statistical dependencies.

Paper Structure

This paper contains 14 sections, 56 equations, 10 figures.

Figures (10)

  • Figure 1: Estimation of $X$ and $Y$ signals using the joint covariance. We fix $b=0.5, q_X=1, q_Y=4$ ($T=200$, $N_X=200$, $N_Y = 800$), such that $b < b_{\mathrm{crit}}$, and then vary the $X$ signal strength $a$. As $a$ increases, in numerical simulations, both the $X$ (green squares) and $Y$ (green circles) components of the estimated spike $\hat{v}_{z, \mathrm{joint}}$ develop nonzero overlap with the true spike when $a^2+b^2$ crosses the threshold $c_{\mathrm{crit}}$ (Eq. \ref{['eq:joint_outlier']}). Lines show analytical predictions, Eqs. (\ref{['eq:joint_overlap']}, \ref{['eq:joint_x']}), which agree with numerical simulations, save for finite-size fluctuations. In contrast, $\hat{v}_{y,{\rm self}}$ always has zero overlap with the signal in $Y$, cf. Eq. (\ref{['eq:y_overlap']}) (blue circles). Averaging is over $n=10$ independent simulations. Error bars are standard deviations.
  • Figure 2: Phase diagram for spike detectability from self and joint covariances. Solid green represents the region where a spike results in a detectable outlier in the joint-covariance matrix. In the region with alternate blue and green hatching, outliers are detectable by both methods. For the white region, none of the methods are able to detect a signal. For this plot $q_X=1$, $q_Y=4$. The dotted lines give the bounds where a spike can be detected in the respective self-covariance. The dashed line represents the parameters used in Fig. \ref{['fig:concatoverlap']}.
  • Figure 3: Estimation of $X$ and $Y$ signals using the cross covariance. We fix $b=2.5, q_X=1, q_Y=20$ ($T = 100$, $N_X = 100$, $N_Y = 2\times10^3$), such that $b < b_{\mathrm{crit}}$, and then vary the $X$ signal strength $a$. As $a$ is increased, in numerical simulations, both $\hat{v}_{x, \mathrm{joint}}$(orange squares) and $\hat{v}_{y, \mathrm{joint}}$ (orange circles) develop nonzero overlap with the true spike when $a b$ crosses the threshold, determined semi-analytically. Lines show semi-analytical predictions for the overlaps, which agree with numerical simulations, save for finite-size fluctuations. In contrast, $\hat{v}_{y,{\rm self}}$ always has zero overlap with the signal in $Y$, cf. Eq. (\ref{['eq:y_overlap']}) (blue circles). Averaging is over $n=10$ independent simulations. Error bars are standard deviations.
  • Figure 4: Phase diagram for spike detectability for cross and self covariances. We fix $q_X=1$, $q_Y=20$ (notice that the value of $q_Y$ is different from Fig. \ref{['Phaseconcat']}, so that advantages of the cross-covariance approach are easier to see). We study how the signal strengths $a$ (for $X$) and $b$ (for $Y$) affect spike detection. In the red region, computed semi-analytically, both the $X$ and $Y$ components of the spike can be partially reconstructed (nonzero overlap). The blue region is where the self covariances of $\mathbf{X}$ and $\mathbf{Y}$ can both detect their spikes, thus providing information about the entire spike. Thus, alternating blue and red stripes mark the region where both approaches give nonzero overlaps with the spike (though the magnitudes of the overlaps may be different). Crucially, the cross covariance may detect the spike when the self covariances cannot, but not the other way around. In the white solid region, neither method can detect the spike.
  • Figure 5: Comparison between joint and cross overlaps for estimating the spike in $Y$. We fix $b=2.5, q_X=1, q_Y=20$ ($T=100$, $N_X=100$, $N_Y=2\times10^3$) such that $b < b_{\mathrm{crit}}$, and $q_Y \gg q_X$, and then vary the $X$ signal strength $a$. As $a$ is increased, in numerical simulations, both $\hat{v}_{y, \mathrm{cross}}$ (orange circles) and $\hat{v}_{y, \mathrm{cross}}$ (green circles) develop nonzero overlap with the true spike $\hat{v}_y$. Colored dashed lines show analytical (joint) and semi-analytical (cross) predictions. In this regime, where $Y$ is much more poorly sampled than $X$, there is a region where the cross $Y$ overlap is large, yet the joint $Y$ overlap is zero. Dotted and dash-dotted black lines represent the analytically (or semi-analytically) calculated BBP transition values for the joint $Y$ overlap and cross $Y$ overlap, respectively. Averaging is over $n=10$ independent simulations. Error bars are standard deviations.
  • ...and 5 more figures