Table of Contents
Fetching ...

Bayesian taut splines for estimating the number of modes

José E. Chacón, Javier Fernández Serrano

TL;DR

This paper tackles the challenge of estimating the number of modes in univariate densities by introducing Bayesian taut splines (BTS), a framework that blends kernel density estimation with compositional splines under Bayes spaces to yield structured, probabilistic modality inferences. BTS proceeds through exploration, analysis, selection, and testing, enabling soft, data-driven decisions about $k$ while incorporating expert judgment. A one-parameter sfpca-based model summarizes the PDF ensemble, and excess-mass–based testing via Savage-Dickey Bayes factors provides local significance of each mode, producing a holistic view that marries global model-structure with local evidence. The approach is demonstrated on Hidalgo stamp data and MLB pitching speeds, with a comprehensive simulation study showing BTS often outperforms traditional modality methods and yields interpretable intermediate results such as mode trees and posterior medians. These contributions offer a practical, interpretable, and robust framework for modality assessment with potential applicability to bounded data pdf estimation and exploratory data analysis.

Abstract

The number of modes in a probability density function is representative of the complexity of a model and can also be viewed as the number of subpopulations. Despite its relevance, there has been limited research in this area. A novel approach to estimating the number of modes in the univariate setting is presented, focusing on prediction accuracy and inspired by some overlooked aspects of the problem: the need for structure in the solutions, the subjective and uncertain nature of modes, and the convenience of a holistic view that blends local and global density properties. The technique combines flexible kernel estimators and parsimonious compositional splines in the Bayesian inference paradigm, providing soft solutions and incorporating expert judgment. The procedure includes feature exploration, model selection, and mode testing, illustrated in a sports analytics case study showcasing multiple companion visualisation tools. A thorough simulation study also demonstrates that traditional modality-driven approaches paradoxically struggle to provide accurate results. In this context, the new method emerges as a top-tier alternative, offering innovative solutions for analysts.

Bayesian taut splines for estimating the number of modes

TL;DR

This paper tackles the challenge of estimating the number of modes in univariate densities by introducing Bayesian taut splines (BTS), a framework that blends kernel density estimation with compositional splines under Bayes spaces to yield structured, probabilistic modality inferences. BTS proceeds through exploration, analysis, selection, and testing, enabling soft, data-driven decisions about while incorporating expert judgment. A one-parameter sfpca-based model summarizes the PDF ensemble, and excess-mass–based testing via Savage-Dickey Bayes factors provides local significance of each mode, producing a holistic view that marries global model-structure with local evidence. The approach is demonstrated on Hidalgo stamp data and MLB pitching speeds, with a comprehensive simulation study showing BTS often outperforms traditional modality methods and yields interpretable intermediate results such as mode trees and posterior medians. These contributions offer a practical, interpretable, and robust framework for modality assessment with potential applicability to bounded data pdf estimation and exploratory data analysis.

Abstract

The number of modes in a probability density function is representative of the complexity of a model and can also be viewed as the number of subpopulations. Despite its relevance, there has been limited research in this area. A novel approach to estimating the number of modes in the univariate setting is presented, focusing on prediction accuracy and inspired by some overlooked aspects of the problem: the need for structure in the solutions, the subjective and uncertain nature of modes, and the convenience of a holistic view that blends local and global density properties. The technique combines flexible kernel estimators and parsimonious compositional splines in the Bayesian inference paradigm, providing soft solutions and incorporating expert judgment. The procedure includes feature exploration, model selection, and mode testing, illustrated in a sports analytics case study showcasing multiple companion visualisation tools. A thorough simulation study also demonstrates that traditional modality-driven approaches paradoxically struggle to provide accurate results. In this context, the new method emerges as a top-tier alternative, offering innovative solutions for analysts.
Paper Structure (55 sections, 24 equations, 25 figures, 5 tables)

This paper contains 55 sections, 24 equations, 25 figures, 5 tables.

Figures (25)

  • Figure 1: The Hidalgo stamps data Izenman1988. The bar chart on the left displays the sample, which comprises 485 measurements of stamp thickness in hundredths of a millimetre. Several pdf for that sample are shown on the right. There are some *kde with different bandwidth selectors, such as *pi, *lscv and *ste, the first two of which count on variations targeting the $r$-th pdf derivative Chacon2013: $\mathrm{\acrshort*{pi}}_{r}$ and $\mathrm{\acrshort*{lscv}}_{r}$. Namely, the pdf are $\mathrm{\acrshort*{pi}}_{0}$ (black, 7 modes), $\mathrm{\acrshort*{pi}}_{1}$ (cyan, 5 modes), $\mathrm{\acrshort*{pi}}_{2}$ (dark blue, 2 modes), *ste (orange, 9 modes), $\mathrm{\acrshort*{lscv}}_{0}$ (red, 11 modes), and a Gaussian mixture (green, 3 modes). The bottom picture shows our bts solution with seven modes based on 32 spline basis functions.
  • Figure 2: An illustrative test-bed for estimating the nom. The M25 mixture model pdf from AmeijeirasAlonso2018 is shown on the left. A histogram of a random sample of size 200 from that model is displayed on the right.
  • Figure 3: mcmc sample from $\mathrm{Pr}(h, \alpha | \mathcal{D})$ consisting of 700 observations. The horizontal and vertical axes represent the $h$ and $\alpha$ components, respectively. The points are coloured according to the nom of model \ref{['eq:kde-spline-model']}: red (57% of the total, 2 modes) and blue (43% of the total, 3 modes). The shape of each point represents the nom of the underlying kde, thus only depending on $h$: square (2 modes), circle (3 modes) and triangle (4 modes), among others. The average point is the black star in the middle of the point cloud. kde for the margins are also provided.
  • Figure 4: sfpca analysis phase results. The scree plot of the ordered pc against their variances is presented on the left. Some representative pdf in the sfpca model \ref{['eq:sfpca-model']} are displayed on the right: the mean $\upmu$ (black, 3 modes), the lower bound $\upmu \oplus \delta_{\min} \odot \upsigma$ (blue, 2 modes) and the upper bound $\upmu \oplus \delta_{\max} \odot \upsigma$ (red, 3 modes).
  • Figure 5: Second Bayesian inference on the sfpca model. The Jeffreys prior pdf \ref{['eq:jeffreys-prior']} is shown on the left. A histogram of a sample from $\mathrm{Pr}(\delta | \mathcal{D})$ consisting of 7,840 observations is shown on the right.
  • ...and 20 more figures