Table of Contents
Fetching ...

Closed-Form Beta Distribution Estimation from Sparse Statistics with Random Forest Implicit Regularization

Jonathan R. Landers

TL;DR

The paper tackles recovering distributional information from sparse statistics to improve time-series classification. It introduces a closed-form estimator for a scaled beta distribution with parameters $(\alpha,\beta)$ using composite quantile and moment matching from $(\min,\max,\mu,\tilde{\mu})$ summaries, and demonstrates that including these parameters as features boosts pairwise Random Forest classification accuracy on seat-price trajectories. The authors establish an accuracy–fidelity bridge via total-variation distance $\mathrm{TV}$ and Jensen-Shannon divergence $\mathrm{JS}$, showing $\mathrm{JS}$ converges quadratically in the small-error regime, thereby justifying using classifier performance as a proxy for distributional fidelity. A novel implicit-regularization mechanism is proposed: adding zero-variance features reshapes split-selection probabilities, increasing tree depth and diversity, and reducing inter-tree correlation, with consistent gains on SeatGeek pricing data and the UCI handwritten digits. The work provides a practical, scalable route from sparse distributional snapshots to closed-form estimation and improved ensemble accuracy, with broad applicability to data-scarce, real-time decision contexts.

Abstract

This work advances distribution recovery from sparse data and ensemble classification through three main contributions. First, we introduce a closed-form estimator that reconstructs scaled beta distributions from limited statistics (minimum, maximum, mean, and median) via composite quantile and moment matching. The recovered parameters $(α,β)$, when used as features in Random Forest classifiers, improve pairwise classification on time-series snapshots, validating the fidelity of the recovered distributions. Second, we establish a link between classification accuracy and distributional closeness by deriving error bounds that constrain total variation distance and Jensen-Shannon divergence, the latter exhibiting quadratic convergence. Third, we show that zero-variance features act as an implicit regularizer, increasing selection probability for mid-ranked predictors and producing deeper, more varied trees. A SeatGeek pricing dataset serves as the primary application, illustrating distributional recovery and event-level classification while situating these methods within the structure and dynamics of the secondary ticket marketplace. The UCI handwritten digits dataset confirms the broader regularization effect. Overall, the study outlines a practical route from sparse distributional snapshots to closed-form estimation and improved ensemble accuracy, with reliability enhanced through implicit regularization.

Closed-Form Beta Distribution Estimation from Sparse Statistics with Random Forest Implicit Regularization

TL;DR

The paper tackles recovering distributional information from sparse statistics to improve time-series classification. It introduces a closed-form estimator for a scaled beta distribution with parameters using composite quantile and moment matching from summaries, and demonstrates that including these parameters as features boosts pairwise Random Forest classification accuracy on seat-price trajectories. The authors establish an accuracy–fidelity bridge via total-variation distance and Jensen-Shannon divergence , showing converges quadratically in the small-error regime, thereby justifying using classifier performance as a proxy for distributional fidelity. A novel implicit-regularization mechanism is proposed: adding zero-variance features reshapes split-selection probabilities, increasing tree depth and diversity, and reducing inter-tree correlation, with consistent gains on SeatGeek pricing data and the UCI handwritten digits. The work provides a practical, scalable route from sparse distributional snapshots to closed-form estimation and improved ensemble accuracy, with broad applicability to data-scarce, real-time decision contexts.

Abstract

This work advances distribution recovery from sparse data and ensemble classification through three main contributions. First, we introduce a closed-form estimator that reconstructs scaled beta distributions from limited statistics (minimum, maximum, mean, and median) via composite quantile and moment matching. The recovered parameters , when used as features in Random Forest classifiers, improve pairwise classification on time-series snapshots, validating the fidelity of the recovered distributions. Second, we establish a link between classification accuracy and distributional closeness by deriving error bounds that constrain total variation distance and Jensen-Shannon divergence, the latter exhibiting quadratic convergence. Third, we show that zero-variance features act as an implicit regularizer, increasing selection probability for mid-ranked predictors and producing deeper, more varied trees. A SeatGeek pricing dataset serves as the primary application, illustrating distributional recovery and event-level classification while situating these methods within the structure and dynamics of the secondary ticket marketplace. The UCI handwritten digits dataset confirms the broader regularization effect. Overall, the study outlines a practical route from sparse distributional snapshots to closed-form estimation and improved ensemble accuracy, with reliability enhanced through implicit regularization.

Paper Structure

This paper contains 24 sections, 8 theorems, 95 equations, 12 figures, 2 tables.

Key Result

Proposition 4.1

Let $\Theta \subset \mathbb{R}^d$ be the space of parameters, where each probability distribution $P$ is parameterized by $\theta \in \Theta$. Define a feature map where $f_i(P)$ represents summary statistics of $P$, such as $\text{Min}_i$, $\text{Max}_i$, $\tilde{\mu}_i$, and $\mu_i$. A classifier $f : \mathbb{R}^k \to \{0,1\}$ is trained to distinguish between two classes based on $\phi(\hat{\t

Figures (12)

  • Figure 1: Event Overview, Buddy Guy at Wilbur Theatre, Boston, MA, 10/3/2023
  • Figure 2: The plots show the distributions of each feature across all events for artists Drake and Olivia Rodrigo. The Hellinger distance and Jensen-Shannon divergence are calculated between each distribution. In this particular comparison of artists, the $\alpha_i$ parameter offers the most distinctive density profile across all events, as indicated by the distribution distance metrics.
  • Figure 3: Random Forest performance comparison using $\mathcal{D}_{\text{basic}}$ vs. $\mathcal{D}_{\alpha \beta}$ features.
  • Figure 4: Statistical vs. distributional pricing representations for the Ed Sheeran concert on 6/29/2023 at Boch Center Wang Theatre.
  • Figure 5: Distributional divergence analysis. (a) shows reconstructed vs. true scaled beta densities sorted by divergence; (b) plots Jensen-Shannon divergence against total variation with theoretical bounds and a mapped loss axis. Together these illustrate how reconstruction fidelity relates to classification-relevant divergence scales.
  • ...and 7 more figures

Theorems & Definitions (15)

  • Proposition 4.1: Parameter Estimation Consistency via Classification Accuracy
  • proof
  • Theorem 4.1: Classification Accuracy and Total Variation Distance
  • proof
  • Theorem 4.2: Classification Accuracy and Jensen-Shannon Divergence
  • proof
  • Theorem 5.1: Zero-Variance Dilution Effect
  • proof
  • Corollary 5.1: Increased Expected Tree Depth
  • proof
  • ...and 5 more