Table of Contents
Fetching ...

To BEE or not to BEE: Estimating more than Entropy with Biased Entropy Estimators

Ilaria Pia la Torre, David A. Kelly, Hector D. Menendez, David Clark

TL;DR

This work addresses estimating Shannon measures when the underlying distribution is unknown by empirically comparing 18 biased entropy estimators on $H(X)$, $I(X;Y)$, and $I(X;Y|Z)$. It demonstrates that the Chao-Shen ($CS$) and Chao-Wang-Jost ($CW$) estimators converge fastest and yield the strongest accuracy across measures, enabling substantial data-effort reductions in software-engineering contexts. The study extends to MI and CMI, introduces a safe-sample concept $F_p$ and its scaling with domain size $k$ (exhibiting exponential decay for $CS/CW$ and linear behavior for others), and provides practical, distribution-agnostic recommendations with a unified Julia implementation. Together, these findings offer concrete guidance for practitioners performing information-theoretic analyses in software engineering, with broad implications for testing, leakage detection, and feature selection.

Abstract

Entropy estimation plays a significant role in biology, economics, physics, communication engineering and other disciplines. It is increasingly used in software engineering, e.g. in software confidentiality, software testing, predictive analysis, machine learning, and software improvement. However accurate estimation is demonstrably expensive in many contexts, including software. Statisticians have consequently developed biased estimators that aim to accurately estimate entropy on the basis of a sample. In this paper we apply 18 widely employed entropy estimators to Shannon measures useful to the software engineer: entropy, mutual information and conditional mutual information. Moreover, we investigate how the estimators are affected by two main influential factors: sample size and domain size. Our experiments range over a large set of randomly generated joint probability distributions and varying sample sizes, rather than choosing just one or two well known probability distributions as in previous investigations. Our most important result is identifying that the Chao-Shen and Chao-Wang-Jost estimators stand out for consistently converging more quickly to the ground truth, regardless of domain size and regardless of the measure used. They also tend to outperform the others in terms of accuracy as sample sizes increase. This discovery enables a significant reduction in data collection effort without compromising performance.

To BEE or not to BEE: Estimating more than Entropy with Biased Entropy Estimators

TL;DR

This work addresses estimating Shannon measures when the underlying distribution is unknown by empirically comparing 18 biased entropy estimators on , , and . It demonstrates that the Chao-Shen () and Chao-Wang-Jost () estimators converge fastest and yield the strongest accuracy across measures, enabling substantial data-effort reductions in software-engineering contexts. The study extends to MI and CMI, introduces a safe-sample concept and its scaling with domain size (exhibiting exponential decay for and linear behavior for others), and provides practical, distribution-agnostic recommendations with a unified Julia implementation. Together, these findings offer concrete guidance for practitioners performing information-theoretic analyses in software engineering, with broad implications for testing, leakage detection, and feature selection.

Abstract

Entropy estimation plays a significant role in biology, economics, physics, communication engineering and other disciplines. It is increasingly used in software engineering, e.g. in software confidentiality, software testing, predictive analysis, machine learning, and software improvement. However accurate estimation is demonstrably expensive in many contexts, including software. Statisticians have consequently developed biased estimators that aim to accurately estimate entropy on the basis of a sample. In this paper we apply 18 widely employed entropy estimators to Shannon measures useful to the software engineer: entropy, mutual information and conditional mutual information. Moreover, we investigate how the estimators are affected by two main influential factors: sample size and domain size. Our experiments range over a large set of randomly generated joint probability distributions and varying sample sizes, rather than choosing just one or two well known probability distributions as in previous investigations. Our most important result is identifying that the Chao-Shen and Chao-Wang-Jost estimators stand out for consistently converging more quickly to the ground truth, regardless of domain size and regardless of the measure used. They also tend to outperform the others in terms of accuracy as sample sizes increase. This discovery enables a significant reduction in data collection effort without compromising performance.
Paper Structure (20 sections, 6 equations, 2 figures, 6 tables)

This paper contains 20 sections, 6 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 2: Mean squared error of the estimations for the MI. The fastest converging methods are highlighted with bold colored lines: CS-CW (purple) and GSB88-SHU (green). Only two values for $k$ are selected to illustrate the performance in boundary scenarios: $k=256$, column (a) and $k=65536$, column (b). Estimators exhibiting outlier behaviour (B-PYM) are shown with dashed red-orange lines. Both the sets are consistent across all three Shannon metrics, indicating uniform performance patterns. For clear visual interpretation, the ANSB estimator is not displayed due to its exponential errors.
  • Figure 3: Additional plots for entropy (row 1) and CMI estimations (row 2), with $k=256$ (column 1) and $k=65536$ (column 2).