Table of Contents
Fetching ...

A Generalized Trace Reconstruction Problem: Recovering a String of Probabilities

Joey Rivkin, Gregory Valiant, Paul Valiant

TL;DR

This work generalizes trace reconstruction by allowing the underlying string to be a length-$n$ vector of probabilities $S=(p_1,\dots,p_n)$ and studying traces formed by sampling Bernoulli bits with those probabilities followed by random deletions with probability $\delta$. The authors prove a strong worst-case lower bound: for $δ$ as small as $O(1/\sqrt{n})$, any algorithm that achieves constant $\ell_\infty$ or $\ell_1$ error smaller than $\Theta(\sqrt{n})$ requires at least $2^{\Omega(\sqrt{n})}$ traces, via a Fourier-analytic bound on an alternating-sum expression under a partial deletion model. In contrast, they establish an average-case positive result: when $p_i$ are i.i.d. uniform on $[0,1]$ and $δ$ is a small constant, there exists a polynomial-time algorithm using poly$(n,1/ε)$ traces to recover $S$ within $\ell_1$-error $ε$ with high probability. The key techniques blend (i) a Fourier/moment-generating-function analysis to bound complicated alternating sums for the lower bound, and (ii) a chunk-based reconstruction scheme that identifies deletion-free regions and aligns them to produce unbiased estimates of the underlying probabilities. The results illuminate a sharp contrast between worst-case hardness and average-case tractability in a natural generalization of trace reconstruction, with potential implications for modeling sequence degradation and population-level mutation effects.

Abstract

We introduce the following natural generalization of trace reconstruction, parameterized by a deletion probability $δ\in (0,1)$ and length $n$: There is a length $n$ string of probabilities, $S=p_1,\ldots,p_n,$ and each "trace" is obtained by 1) sampling a length $n$ binary string whose $i$th coordinate is independently set to 1 with probability $p_i$ and 0 otherwise, and then 2) deleting each of the binary values independently with probability $δ$, and returning the corresponding binary string of length $\le n$. The goal is to recover an estimate of $S$ from a set of independently drawn traces. In the case that all $p_i \in \{0,1\}$ this is the standard trace reconstruction problem. We show two complementary results. First, for worst-case strings $S$ and any deletion probability at least order $1/\sqrt{n}$, no algorithm can approximate $S$ to constant $\ell_\infty$ distance or $\ell_1$ distance $o(\sqrt n)$ using fewer than $2^{Ω(\sqrt{n})}$ traces. Second -- as in the case for standard trace reconstruction -- reconstruction is easy for random $S$: for any sufficiently small constant deletion probability, and any $ε>0$, drawing each $p_i$ independently from the uniform distribution over $[0,1]$, with high probability $S$ can be recovered to $\ell_1$ error $ε$ using $\mathrm{poly}(n,1/ε)$ traces and computation time. We show indistinguishability in our lower bound by regarding a complicated alternating sum (comparing two distributions) as the Fourier transformation of some function evaluated at $\pm π,$ and then showing that the Fourier transform decays rapidly away from zero by analyzing its moment generating function.

A Generalized Trace Reconstruction Problem: Recovering a String of Probabilities

TL;DR

This work generalizes trace reconstruction by allowing the underlying string to be a length- vector of probabilities and studying traces formed by sampling Bernoulli bits with those probabilities followed by random deletions with probability . The authors prove a strong worst-case lower bound: for as small as , any algorithm that achieves constant or error smaller than requires at least traces, via a Fourier-analytic bound on an alternating-sum expression under a partial deletion model. In contrast, they establish an average-case positive result: when are i.i.d. uniform on and is a small constant, there exists a polynomial-time algorithm using poly traces to recover within -error with high probability. The key techniques blend (i) a Fourier/moment-generating-function analysis to bound complicated alternating sums for the lower bound, and (ii) a chunk-based reconstruction scheme that identifies deletion-free regions and aligns them to produce unbiased estimates of the underlying probabilities. The results illuminate a sharp contrast between worst-case hardness and average-case tractability in a natural generalization of trace reconstruction, with potential implications for modeling sequence degradation and population-level mutation effects.

Abstract

We introduce the following natural generalization of trace reconstruction, parameterized by a deletion probability and length : There is a length string of probabilities, and each "trace" is obtained by 1) sampling a length binary string whose th coordinate is independently set to 1 with probability and 0 otherwise, and then 2) deleting each of the binary values independently with probability , and returning the corresponding binary string of length . The goal is to recover an estimate of from a set of independently drawn traces. In the case that all this is the standard trace reconstruction problem. We show two complementary results. First, for worst-case strings and any deletion probability at least order , no algorithm can approximate to constant distance or distance using fewer than traces. Second -- as in the case for standard trace reconstruction -- reconstruction is easy for random : for any sufficiently small constant deletion probability, and any , drawing each independently from the uniform distribution over , with high probability can be recovered to error using traces and computation time. We show indistinguishability in our lower bound by regarding a complicated alternating sum (comparing two distributions) as the Fourier transformation of some function evaluated at and then showing that the Fourier transform decays rapidly away from zero by analyzing its moment generating function.

Paper Structure

This paper contains 9 sections, 14 theorems, 27 equations, 2 algorithms.

Key Result

Theorem 2

There exist a pair of length $n$ sequences $S=p_1,\ldots,p_n$ and $S'=p'_1,\ldots,p'_n$ with constant $\ell_\infty$ distance and $\ell_1$ distance $\Theta(\sqrt{n})$, and an absolute constant $c$ such that for any deletion probability $\delta \geq \frac{c}{\sqrt{n}}$---and in particular, for all con

Theorems & Definitions (28)

  • Definition 1: Generalized Trace Reconstruction
  • Theorem 2
  • Theorem 3
  • Definition 4
  • Theorem 5
  • Lemma 6
  • proof
  • Lemma 7
  • proof
  • Lemma 8
  • ...and 18 more