Table of Contents
Fetching ...

Beta-CoRM: A Bayesian Approach for $n$-gram Profiles Analysis

José A. Perusquía, Jim E. Griffin, Cristiano Villa

TL;DR

This paper tackles the challenge of probabilistically modeling binary $n$-gram profiles for grouped data, addressing high dimensionality via a Bayesian nonparametric approach. It introduces beta-CoRM, a compound random measure-based model with a beta process directing shared features and group-specific perturbations, and extends it with a feature-selection mechanism through per-feature score parameters. The authors develop a slice-sampling-based posterior inference scheme, derive full conditionals for key parameters, and conduct prior-sensitivity analyses. Through malware data, they demonstrate how feature selection improves discrimination and show competitive performance against standard classifiers, highlighting the method’s practical impact for cyber-security and other domains requiring interpretable, probabilistic modeling of high-dimensional binary features.

Abstract

$n$-gram profiles have been successfully and widely used to analyse long sequences of potentially differing lengths for clustering or classification. Mainly, machine learning algorithms have been used for this purpose but, despite their predictive performance, these methods cannot discover hidden structures or provide a full probabilistic representation of the data. A novel class of Bayesian generative models designed for $n$-gram profiles used as binary attributes have been designed to address this. The flexibility of the proposed modelling allows to consider a straightforward approach to feature selection in the generative model. Furthermore, a slice sampling algorithm is derived for a fast inferential procedure, which is applied to synthetic and real data scenarios and shows that feature selection can improve classification accuracy.

Beta-CoRM: A Bayesian Approach for $n$-gram Profiles Analysis

TL;DR

This paper tackles the challenge of probabilistically modeling binary -gram profiles for grouped data, addressing high dimensionality via a Bayesian nonparametric approach. It introduces beta-CoRM, a compound random measure-based model with a beta process directing shared features and group-specific perturbations, and extends it with a feature-selection mechanism through per-feature score parameters. The authors develop a slice-sampling-based posterior inference scheme, derive full conditionals for key parameters, and conduct prior-sensitivity analyses. Through malware data, they demonstrate how feature selection improves discrimination and show competitive performance against standard classifiers, highlighting the method’s practical impact for cyber-security and other domains requiring interpretable, probabilistic modeling of high-dimensional binary features.

Abstract

-gram profiles have been successfully and widely used to analyse long sequences of potentially differing lengths for clustering or classification. Mainly, machine learning algorithms have been used for this purpose but, despite their predictive performance, these methods cannot discover hidden structures or provide a full probabilistic representation of the data. A novel class of Bayesian generative models designed for -gram profiles used as binary attributes have been designed to address this. The flexibility of the proposed modelling allows to consider a straightforward approach to feature selection in the generative model. Furthermore, a slice sampling algorithm is derived for a fast inferential procedure, which is applied to synthetic and real data scenarios and shows that feature selection can improve classification accuracy.

Paper Structure

This paper contains 22 sections, 4 theorems, 46 equations, 7 figures, 12 tables.

Key Result

Proposition 1

Let B be a beta process with discrete base measure as in BP and B0 respectively. Then

Figures (7)

  • Figure 1: Cumulative distribution function of the slab distribution for different values of the score parameter $a$ and $x_0=.00001$.
  • Figure 2: Synthetic data set composed of 5 imbalanced groups separated by the red lines with 250 total observations and 300 binary features.
  • Figure 3: Posterior mean estimates of the score parameters $a_i$'s for the generalised beta-CoRM with different hyperpriors.
  • Figure 4: Malware data set split into training set (left) and test set (right). For each plot the nine families are separated by the red horizontal lines. The black dots represent that a feature is present within the respective malware hexadecimal code.
  • Figure 5: Posterior mean estimates of the score parameters $a_i$'s for the generalised beta-CoRM with different hyperpriors.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Lemma 1