Closed-Form Beta Distribution Estimation from Sparse Statistics with Random Forest Implicit Regularization
Jonathan R. Landers
TL;DR
The paper tackles recovering distributional information from sparse statistics to improve time-series classification. It introduces a closed-form estimator for a scaled beta distribution with parameters $(\alpha,\beta)$ using composite quantile and moment matching from $(\min,\max,\mu,\tilde{\mu})$ summaries, and demonstrates that including these parameters as features boosts pairwise Random Forest classification accuracy on seat-price trajectories. The authors establish an accuracy–fidelity bridge via total-variation distance $\mathrm{TV}$ and Jensen-Shannon divergence $\mathrm{JS}$, showing $\mathrm{JS}$ converges quadratically in the small-error regime, thereby justifying using classifier performance as a proxy for distributional fidelity. A novel implicit-regularization mechanism is proposed: adding zero-variance features reshapes split-selection probabilities, increasing tree depth and diversity, and reducing inter-tree correlation, with consistent gains on SeatGeek pricing data and the UCI handwritten digits. The work provides a practical, scalable route from sparse distributional snapshots to closed-form estimation and improved ensemble accuracy, with broad applicability to data-scarce, real-time decision contexts.
Abstract
This work advances distribution recovery from sparse data and ensemble classification through three main contributions. First, we introduce a closed-form estimator that reconstructs scaled beta distributions from limited statistics (minimum, maximum, mean, and median) via composite quantile and moment matching. The recovered parameters $(α,β)$, when used as features in Random Forest classifiers, improve pairwise classification on time-series snapshots, validating the fidelity of the recovered distributions. Second, we establish a link between classification accuracy and distributional closeness by deriving error bounds that constrain total variation distance and Jensen-Shannon divergence, the latter exhibiting quadratic convergence. Third, we show that zero-variance features act as an implicit regularizer, increasing selection probability for mid-ranked predictors and producing deeper, more varied trees. A SeatGeek pricing dataset serves as the primary application, illustrating distributional recovery and event-level classification while situating these methods within the structure and dynamics of the secondary ticket marketplace. The UCI handwritten digits dataset confirms the broader regularization effect. Overall, the study outlines a practical route from sparse distributional snapshots to closed-form estimation and improved ensemble accuracy, with reliability enhanced through implicit regularization.
