A Generalized Trace Reconstruction Problem: Recovering a String of Probabilities
Joey Rivkin, Gregory Valiant, Paul Valiant
TL;DR
This work generalizes trace reconstruction by allowing the underlying string to be a length-$n$ vector of probabilities $S=(p_1,\dots,p_n)$ and studying traces formed by sampling Bernoulli bits with those probabilities followed by random deletions with probability $\delta$. The authors prove a strong worst-case lower bound: for $δ$ as small as $O(1/\sqrt{n})$, any algorithm that achieves constant $\ell_\infty$ or $\ell_1$ error smaller than $\Theta(\sqrt{n})$ requires at least $2^{\Omega(\sqrt{n})}$ traces, via a Fourier-analytic bound on an alternating-sum expression under a partial deletion model. In contrast, they establish an average-case positive result: when $p_i$ are i.i.d. uniform on $[0,1]$ and $δ$ is a small constant, there exists a polynomial-time algorithm using poly$(n,1/ε)$ traces to recover $S$ within $\ell_1$-error $ε$ with high probability. The key techniques blend (i) a Fourier/moment-generating-function analysis to bound complicated alternating sums for the lower bound, and (ii) a chunk-based reconstruction scheme that identifies deletion-free regions and aligns them to produce unbiased estimates of the underlying probabilities. The results illuminate a sharp contrast between worst-case hardness and average-case tractability in a natural generalization of trace reconstruction, with potential implications for modeling sequence degradation and population-level mutation effects.
Abstract
We introduce the following natural generalization of trace reconstruction, parameterized by a deletion probability $δ\in (0,1)$ and length $n$: There is a length $n$ string of probabilities, $S=p_1,\ldots,p_n,$ and each "trace" is obtained by 1) sampling a length $n$ binary string whose $i$th coordinate is independently set to 1 with probability $p_i$ and 0 otherwise, and then 2) deleting each of the binary values independently with probability $δ$, and returning the corresponding binary string of length $\le n$. The goal is to recover an estimate of $S$ from a set of independently drawn traces. In the case that all $p_i \in \{0,1\}$ this is the standard trace reconstruction problem. We show two complementary results. First, for worst-case strings $S$ and any deletion probability at least order $1/\sqrt{n}$, no algorithm can approximate $S$ to constant $\ell_\infty$ distance or $\ell_1$ distance $o(\sqrt n)$ using fewer than $2^{Ω(\sqrt{n})}$ traces. Second -- as in the case for standard trace reconstruction -- reconstruction is easy for random $S$: for any sufficiently small constant deletion probability, and any $ε>0$, drawing each $p_i$ independently from the uniform distribution over $[0,1]$, with high probability $S$ can be recovered to $\ell_1$ error $ε$ using $\mathrm{poly}(n,1/ε)$ traces and computation time. We show indistinguishability in our lower bound by regarding a complicated alternating sum (comparing two distributions) as the Fourier transformation of some function evaluated at $\pm π,$ and then showing that the Fourier transform decays rapidly away from zero by analyzing its moment generating function.
