Table of Contents
Fetching ...

Confidence Intervals Using Turing's Estimator: Simulations and Applications

Jie Chang, Michael Grabchak, Jialin Zhang

TL;DR

The paper addresses estimating missing mass and occupancy probabilities with Turing's estimator and constructing reliable confidence intervals (CIs) in finite samples. It develops and compares CI types derived from asymptotic normality ($s_{r,n}\to\infty$) and asymptotic Poissonity ($s_{r,n}\to c$), plus a heuristic CI to select between them, and validates these methods through extensive simulations on discrete uniform, geometric, and discrete Pareto models. A novel CI-based approach for authorship attribution is proposed and applied to Twitter data, demonstrating practical utility in assessing whether two writing samples originate from the same author. The work also provides theoretical results on asymptotic normality and Poissonity for two discrete distributions, with discussions on heavy-tailed behavior and pre-limit phenomena that inform CI performance in practice.

Abstract

Turing's estimator allows one to estimate the probabilities of outcomes that either do not appear or only rarely appear in a given random sample. We perform a simulation study to understand the finite sample performance of several related confidence intervals (CIs) and introduce an approach for selecting the appropriate CI for a given sample. We give an application to the problem of authorship attribution and apply it to a dataset comprised of tweets from users on X (Twitter). Further, we derive several theoretical results about asymptotic normality and asymptotic Poissonity of Turing's estimator for two important discrete distributions.

Confidence Intervals Using Turing's Estimator: Simulations and Applications

TL;DR

The paper addresses estimating missing mass and occupancy probabilities with Turing's estimator and constructing reliable confidence intervals (CIs) in finite samples. It develops and compares CI types derived from asymptotic normality () and asymptotic Poissonity (), plus a heuristic CI to select between them, and validates these methods through extensive simulations on discrete uniform, geometric, and discrete Pareto models. A novel CI-based approach for authorship attribution is proposed and applied to Twitter data, demonstrating practical utility in assessing whether two writing samples originate from the same author. The work also provides theoretical results on asymptotic normality and Poissonity for two discrete distributions, with discussions on heavy-tailed behavior and pre-limit phenomena that inform CI performance in practice.

Abstract

Turing's estimator allows one to estimate the probabilities of outcomes that either do not appear or only rarely appear in a given random sample. We perform a simulation study to understand the finite sample performance of several related confidence intervals (CIs) and introduce an approach for selecting the appropriate CI for a given sample. We give an application to the problem of authorship attribution and apply it to a dataset comprised of tweets from users on X (Twitter). Further, we derive several theoretical results about asymptotic normality and asymptotic Poissonity of Turing's estimator for two important discrete distributions.

Paper Structure

This paper contains 9 sections, 6 theorems, 45 equations, 6 figures, 1 table.

Key Result

Proposition 1

Let $a=-1/\log(1-p)$. We have

Figures (6)

  • Figure 1: Results for the Discrete Uniform. Each plot gives the coverage proportions and the mean widths for the three $95\%$ CIs for different choices of $r$ and $K$. These are based on $N=5000$ replications. The sample size on the $x$-axis is presented on a log (base 10) scale. The horizontal dashed line is a reference line at $0.95$.
  • Figure 2: Results for the Dynamic Discrete Uniform. Each plot gives the coverage proportions and the mean widths for the three $95\%$ CIs for different choices of $r$ and $\gamma$. The plots are based on $N=5000$ replications. The sample size on the $x$-axis is presented on a log (base 10) scale. The horizontal dashed line is a reference line at $0.95$.
  • Figure 3: Results for the Geometric Distribution. Each plot gives the coverage proportions and the mean widths for the three $95\%$ CIs. The plots are based on $N=5000$ replications. The first three rows give plots for fixed distributions and the fourth for a dynamic distribution. The sample size on the $x$-axis is presented on a log (base 10) scale. The horizontal dashed line is a reference line at $0.95$.
  • Figure 4: Results for the Discrete Pareto Distribution. Each plot gives the coverage proportion and the mean width of three $95\%$ CIs for different choices of $r$ and $\alpha$. These are based on $N=5000$ replications. The horizontal axis for sample size is displayed on a log (base 10) scale. The horizontal dashed line is a reference line at $0.95$.
  • Figure 5: The CIs are constructed from the Corpus and the detecting points are calculated from the testing sets.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Lemma 1
  • Lemma 2
  • proof
  • Lemma 3
  • proof