Optimal Structured Matrix Approximation for Robustness to Incomplete Biosequence Data
Chris Salahub, Jeffrey Uhlmann
TL;DR
The paper addresses the challenge of performing spectral analyses on large $n \times n$ bioinformatics matrices when data are incomplete by introducing a general Frobenius-norm minimization framework that approximates any matrix with a structured one. It proves that the optimum structured matrix $\mathbf{T}_M$ is obtained by replacing each index-set with the mean of the corresponding entries, yielding a space reduction from $O(n^2)$ to $O(|\mathbf{t}|)$ and a residual tied to within-group variance; in the circulant case, the implementation aligns with the DFT. The authors illustrate the approach with simulated and real LD data, showing improved robustness to missing data (fewer negative eigenvalues and smaller eigenvalue-discrepancy) at the cost of bias when data are nearly complete, and discuss extensions such as a structured-plus-diagonal form to preserve positive semidefiniteness. Overall, the method offers a simple, efficient procedure for robust matrix approximation with broad applicability in genomics and beyond.
Abstract
We propose a general method for optimally approximating an arbitrary matrix $\mathbf{M}$ by a structured matrix $\mathbf{T}$ (circulant, Toeplitz/Hankel, etc.) and examine its use for estimating the spectra of genomic linkage disequilibrium matrices. This application is prototypical of a variety of genomic and proteomic problems that demand robustness to incomplete biosequence information. We perform a simulation study and corroborative test of our method using real genomic data from the Mouse Genome Database. The results confirm the predicted utility of the method and provide strong evidence of its potential value to a wide range of bioinformatics applications. Our optimal general matrix approximation method is expected to be of independent interest to an even broader range of applications in applied mathematics and engineering.
