Using Chao's Estimator as a Stopping Criterion for Technology-Assisted Review
Michiel P. Bron, Peter G. M. van der Heijden, Ad J. Feelders, Arno P. J. M. Siebes
TL;DR
This paper introduces a stopping criterion for Technology-Assisted Review based on Population Size Estimation using Chao's estimator to bound the total number of relevant documents $| ext{D}^+|$. It integrates two versions of Chao's estimator (Chao 1987 and Chao Rivest) within an ensemble Active Learning TAR framework that allows single- and multi-user document ranking, sampling, and decision-making. Through extensive simulations on diverse TAR datasets, the authors compare estimator-based stopping with existing criteria, showing that the Rivest variant often yields superior recall and work savings, while the conservative Chao 1987 approach provides robust reliability. The work demonstrates that PSE-based stopping can offer formal stopping guarantees and informative recall estimates, potentially improving decision support for reviewers in large-scale literature searches. Practical impact includes more reliable stopping decisions in systematic reviews and related text screening tasks, with clear trade-offs between recall guarantee and reader workload.
Abstract
Technology-Assisted Review (TAR) aims to reduce the human effort required for screening processes such as abstract screening for systematic literature reviews. Human reviewers label documents as relevant or irrelevant during this process, while the system incrementally updates a prediction model based on the reviewers' previous decisions. After each model update, the system proposes new documents it deems relevant, to prioritize relevant documentsover irrelevant ones. A stopping criterion is necessary to guide users in stopping the review process to minimize the number of missed relevant documents and the number of read irrelevant documents. In this paper, we propose and evaluate a new ensemble-based Active Learning strategy and a stopping criterion based on Chao's Population Size Estimator that estimates the prevalence of relevant documents in the dataset. Our simulation study demonstrates that this criterion performs well on several datasets and is compared to other methods presented in the literature.
