Table of Contents
Fetching ...

General Coverage Models: Structure, Monotonicity, and Shotgun Sequencing

Yitzchak Grunbaum, Eitan Yaakobi

TL;DR

This work develops a unified combinatorial framework for coverage problems where each draw reveals a subset of a finite set, recasting the expected coverage time as a counting problem over recovery sets in a hypergraph. It applies the method to structured window-based models (non-cyclic and cyclic windows) and their continuous-arc analogues, deriving exact expressions and asymptotics, notably $E[T_{n,ll}^{circlearrowright}]\sim rac{n}{ll}\log rac{n}{ll}$ for $ll=o(n)$. The paper also studies uniform $ll$-regular models, proving monotonicity properties and establishing universal upper bounds, while showing the batch (uniform $ll$-subset) model often yields the largest coverage time among these models. Collectively, the results connect discrete window-based coverage to the continuous-circle analogue and to the classic coupon-collector framework, with implications for DNA shotgun sequencing and related data-coverage problems. The proposed framework yields exact, computable expressions and sharp asymptotics, while highlighting open questions about subleading terms and model comparisons across structured sampling schemes.

Abstract

We study coverage processes in which each draw reveals a subset of $[n]$, and the goal is to determine the expected number of draws until all items are seen at least once. A classical example is the Coupon Collector's Problem, where each draw reveals exactly one item. Motivated by shotgun DNA sequencing, we introduce a model where each draw is a contiguous window of fixed length, in both cyclic and non-cyclic variants. We develop a unifying combinatorial tool that shifts the task of finding coverage time from probability, to a counting problem over families of subsets of $[n]$ that together contain all items, enabling exact calculation. Using this result, we obtain exact expressions for the window models. We then leverage past results on a continuous analogue of the cyclic window model to analyze the asymptotic behavior of both models. We further study what we call uniform $\ell$-regular models, where every draw has size $\ell$ and every item appears in the same number of admissible draws. We compare these to the batch sampling model, in which all $\ell$-subsets are drawn uniformly at random and present upper and lower bounds, which were also obtained independently by Berend and Sher. We conjecture, and prove for special cases, that this model maximizes the coverage time among all uniform $\ell$-regular models. Finally, we prove a universal upper bound on the entire class of uniform $\ell$-regular models, which illuminates the fact that many sampling models share the same leading asymptotic order, while potentially differing significantly in lower-order terms.

General Coverage Models: Structure, Monotonicity, and Shotgun Sequencing

TL;DR

This work develops a unified combinatorial framework for coverage problems where each draw reveals a subset of a finite set, recasting the expected coverage time as a counting problem over recovery sets in a hypergraph. It applies the method to structured window-based models (non-cyclic and cyclic windows) and their continuous-arc analogues, deriving exact expressions and asymptotics, notably for . The paper also studies uniform -regular models, proving monotonicity properties and establishing universal upper bounds, while showing the batch (uniform -subset) model often yields the largest coverage time among these models. Collectively, the results connect discrete window-based coverage to the continuous-circle analogue and to the classic coupon-collector framework, with implications for DNA shotgun sequencing and related data-coverage problems. The proposed framework yields exact, computable expressions and sharp asymptotics, while highlighting open questions about subleading terms and model comparisons across structured sampling schemes.

Abstract

We study coverage processes in which each draw reveals a subset of , and the goal is to determine the expected number of draws until all items are seen at least once. A classical example is the Coupon Collector's Problem, where each draw reveals exactly one item. Motivated by shotgun DNA sequencing, we introduce a model where each draw is a contiguous window of fixed length, in both cyclic and non-cyclic variants. We develop a unifying combinatorial tool that shifts the task of finding coverage time from probability, to a counting problem over families of subsets of that together contain all items, enabling exact calculation. Using this result, we obtain exact expressions for the window models. We then leverage past results on a continuous analogue of the cyclic window model to analyze the asymptotic behavior of both models. We further study what we call uniform -regular models, where every draw has size and every item appears in the same number of admissible draws. We compare these to the batch sampling model, in which all -subsets are drawn uniformly at random and present upper and lower bounds, which were also obtained independently by Berend and Sher. We conjecture, and prove for special cases, that this model maximizes the coverage time among all uniform -regular models. Finally, we prove a universal upper bound on the entire class of uniform -regular models, which illuminates the fact that many sampling models share the same leading asymptotic order, while potentially differing significantly in lower-order terms.

Paper Structure

This paper contains 10 sections, 15 theorems, 90 equations, 5 figures.

Key Result

Theorem 1

Given a hypergraph $\mathcal{H}$,

Figures (5)

  • Figure 1: Example run of Problem \ref{['non cyclic windows problem']} with $n=8$ and $\ell=3$; coverage at $T_{n,\ell}=5$.
  • Figure 2: Example run of Problem \ref{['cyclic windows problem']} with $n=10$ and $\ell=4$; coverage at $T_{n,\ell}^\circlearrowright=4$.
  • Figure 3: Example run of Problem \ref{['continous problem']} with $a=0.3$. On round $t$, the arc $I_t=[U_t,U_t+a)\pmod{1}$ is placed; coverage at $\mathcal{T}_{0.3}=5$.
  • Figure 4: Example run of \ref{['batch sampling problem']} with $n=7$ and $\ell=2$; coverage at $T_{\binom{[n]}{\ell}}=7$.
  • Figure 5: Layered Markov chain, representing the case $\ell\ge \frac{n}{3}$.

Theorems & Definitions (28)

  • Example 1
  • Example 2
  • Example 3
  • Example 4
  • Definition 1
  • Definition 2
  • Example 5
  • Example 6
  • Theorem 1
  • Lemma 1
  • ...and 18 more