Doubly-Unlinked Regression for Dependent Data

Anik Burman; Sayantan Choudhury; Debangan Dey

Doubly-Unlinked Regression for Dependent Data

Anik Burman, Sayantan Choudhury, Debangan Dey

Abstract

Shuffled regression concerns settings in which covariates and responses are observed without their correct pairing. In dependent-data problems, a second form of missing correspondence can arise when responses are also detached from the latent temporal, spatial, or geometric domain that induces their dependence structure. We study regression under this joint loss of correspondence and, to our knowledge, provide the first systematic treatment of this setting. Specifically, we consider a doubly-unlinked regression model in which both the covariate-response link and the response-domain link are unknown, represented by two latent permutation matrices, while dependence is induced by an unobserved stochastic process. This framework unifies shuffled regression and latent-domain permutation models within a common dependent-data setting. We characterize signal-to-noise regimes governing recovery of the regression parameter and the latent permutations, and show that consistent estimation of the regression coefficient can be achieved under strictly weaker conditions than exact permutation recovery. To address the combinatorial difficulty of inference, we develop REPAIR, a variational Bayes method based on a block-structured permutation model that captures localized scrambling while substantially reducing computational complexity. Simulations and an applied example illustrate the empirical behavior of REPAIR and support the theoretical results.

Doubly-Unlinked Regression for Dependent Data

Abstract

Paper Structure (20 sections, 21 theorems, 171 equations, 6 figures, 2 algorithms)

This paper contains 20 sections, 21 theorems, 171 equations, 6 figures, 2 algorithms.

Introduction
Background and Problem Setup
Theoretical Guarantees for MLE
Variational Inference for Joint Permutation and Parameter Estimation
Simulation Studies
Real Data Analysis
Discussion
Software
Acknowledgments
Details of Variational Inference
Variational Density of Parameters
Full ELBO
Proofs
Propositions used in Theorems
Proof of Theorem \ref{['thm:error-pi1-pi2']}
...and 5 more sections

Key Result

Theorem 1

For $\mathrm{SNR} = \Omega(K^\alpha)$ with $\alpha >1$ and $B \geq B_\ast(K,\alpha) \coloneq \frac{\alpha\log K}{K^\alpha - K}$, with the MLE estimator $\left( \widehat{\boldsymbol{\Pi}}_{1, \texttt{ML}}, \widehat{\boldsymbol{\Pi}}_{2, \texttt{ML}} \right) = \underset{(\boldsymbol{\pi}_1,\boldsymbol where the total sample size $n = KB$ for some universal constant $c^\ast_1,c^\ast_2 >0$.

Figures (6)

Figure 1: Schematic illustration of the doubly-unlinked regression setting. The left panel shows the classical linked-data setup with aligned exposures, outcomes, and domain variables. The middle panel illustrates the doubly-unlinked case where exposures and domain variables are observed after unknown permutations. The goal is to recover the regression effect $\beta$ and the permutation matrices $\boldsymbol{\pi}_X$ and $\boldsymbol{\pi}_S$.
Figure 2: Scaled RMSE of $\hat{\beta}$ across simulation settings. Columns correspond to the number of regions $B$ and the horizontal axis shows the number of observations per region $K$. Results are reported for two SNR regimes, $\beta=2$ and $\beta=8$, comparing FullGP, ArealGP, and REPAIR.
Figure 3: Permutation recovery performance of the REPAIR method across simulation settings. Cells show the estimated recovery probability for the permutation matrices $\pi_S$ and $\pi_X$ across different values of the number of blocks $B$ and observations per block $K$, under two signal regimes ($\beta = 2$ and $\beta = 8$).
Figure 4: Partitions for the Meuse analysis after dropping five observations: (a) $15\times 10$ and (b) $30\times 5$. Colored points denote block memberships, and red crosses indicate the removed observations.
Figure 5: Comparison of latent surface $W$ estimates from REPAIR and the FullGP benchmark under the two blocking schemes. Panels (1a) and (2a) plot $\hat{\boldsymbol{\mu}}_W^{\mathrm{REPAIR}}$ against $\hat{\boldsymbol{\mu}}_W^{\mathrm{FullGP}}$ for the $30\times 5$ and $15\times 10$ partitions, respectively. Panels (1b) and (2b) plot $\widehat{\boldsymbol{\Pi}}_S \hat{\boldsymbol{\mu}}_W^{\mathrm{REPAIR}}$ against $\hat{\boldsymbol{\mu}}_W^{\mathrm{FullGP}}$. The dashed blue line is the $45^\circ$ reference line, and the solid red line is the least-squares fit.
...and 1 more figures

Theorems & Definitions (32)

Theorem 1
Theorem 2
Theorem 3: ren2011variational
Proposition 1: Conditional Concentration of $\mathbf{E}_1$ (\ref{['eq:error-partition']})
proof
Proposition 2
proof
Proposition 3
proof
Proposition 4
...and 22 more

Doubly-Unlinked Regression for Dependent Data

Abstract

Doubly-Unlinked Regression for Dependent Data

Authors

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (32)