Confidence Intervals Using Turing's Estimator: Simulations and Applications
Jie Chang, Michael Grabchak, Jialin Zhang
TL;DR
The paper addresses estimating missing mass and occupancy probabilities with Turing's estimator and constructing reliable confidence intervals (CIs) in finite samples. It develops and compares CI types derived from asymptotic normality ($s_{r,n}\to\infty$) and asymptotic Poissonity ($s_{r,n}\to c$), plus a heuristic CI to select between them, and validates these methods through extensive simulations on discrete uniform, geometric, and discrete Pareto models. A novel CI-based approach for authorship attribution is proposed and applied to Twitter data, demonstrating practical utility in assessing whether two writing samples originate from the same author. The work also provides theoretical results on asymptotic normality and Poissonity for two discrete distributions, with discussions on heavy-tailed behavior and pre-limit phenomena that inform CI performance in practice.
Abstract
Turing's estimator allows one to estimate the probabilities of outcomes that either do not appear or only rarely appear in a given random sample. We perform a simulation study to understand the finite sample performance of several related confidence intervals (CIs) and introduce an approach for selecting the appropriate CI for a given sample. We give an application to the problem of authorship attribution and apply it to a dataset comprised of tweets from users on X (Twitter). Further, we derive several theoretical results about asymptotic normality and asymptotic Poissonity of Turing's estimator for two important discrete distributions.
