To BEE or not to BEE: Estimating more than Entropy with Biased Entropy Estimators
Ilaria Pia la Torre, David A. Kelly, Hector D. Menendez, David Clark
TL;DR
This work addresses estimating Shannon measures when the underlying distribution is unknown by empirically comparing 18 biased entropy estimators on $H(X)$, $I(X;Y)$, and $I(X;Y|Z)$. It demonstrates that the Chao-Shen ($CS$) and Chao-Wang-Jost ($CW$) estimators converge fastest and yield the strongest accuracy across measures, enabling substantial data-effort reductions in software-engineering contexts. The study extends to MI and CMI, introduces a safe-sample concept $F_p$ and its scaling with domain size $k$ (exhibiting exponential decay for $CS/CW$ and linear behavior for others), and provides practical, distribution-agnostic recommendations with a unified Julia implementation. Together, these findings offer concrete guidance for practitioners performing information-theoretic analyses in software engineering, with broad implications for testing, leakage detection, and feature selection.
Abstract
Entropy estimation plays a significant role in biology, economics, physics, communication engineering and other disciplines. It is increasingly used in software engineering, e.g. in software confidentiality, software testing, predictive analysis, machine learning, and software improvement. However accurate estimation is demonstrably expensive in many contexts, including software. Statisticians have consequently developed biased estimators that aim to accurately estimate entropy on the basis of a sample. In this paper we apply 18 widely employed entropy estimators to Shannon measures useful to the software engineer: entropy, mutual information and conditional mutual information. Moreover, we investigate how the estimators are affected by two main influential factors: sample size and domain size. Our experiments range over a large set of randomly generated joint probability distributions and varying sample sizes, rather than choosing just one or two well known probability distributions as in previous investigations. Our most important result is identifying that the Chao-Shen and Chao-Wang-Jost estimators stand out for consistently converging more quickly to the ground truth, regardless of domain size and regardless of the measure used. They also tend to outperform the others in terms of accuracy as sample sizes increase. This discovery enables a significant reduction in data collection effort without compromising performance.
