Beta-CoRM: A Bayesian Approach for $n$-gram Profiles Analysis
José A. Perusquía, Jim E. Griffin, Cristiano Villa
TL;DR
This paper tackles the challenge of probabilistically modeling binary $n$-gram profiles for grouped data, addressing high dimensionality via a Bayesian nonparametric approach. It introduces beta-CoRM, a compound random measure-based model with a beta process directing shared features and group-specific perturbations, and extends it with a feature-selection mechanism through per-feature score parameters. The authors develop a slice-sampling-based posterior inference scheme, derive full conditionals for key parameters, and conduct prior-sensitivity analyses. Through malware data, they demonstrate how feature selection improves discrimination and show competitive performance against standard classifiers, highlighting the method’s practical impact for cyber-security and other domains requiring interpretable, probabilistic modeling of high-dimensional binary features.
Abstract
$n$-gram profiles have been successfully and widely used to analyse long sequences of potentially differing lengths for clustering or classification. Mainly, machine learning algorithms have been used for this purpose but, despite their predictive performance, these methods cannot discover hidden structures or provide a full probabilistic representation of the data. A novel class of Bayesian generative models designed for $n$-gram profiles used as binary attributes have been designed to address this. The flexibility of the proposed modelling allows to consider a straightforward approach to feature selection in the generative model. Furthermore, a slice sampling algorithm is derived for a fast inferential procedure, which is applied to synthetic and real data scenarios and shows that feature selection can improve classification accuracy.
