Ensemble Methods for Sequence Classification with Hidden Markov Models
Maxime Kawawa-Beaudan, Srijan Sood, Soham Palande, Ganapathy Mani, Tucker Balch, Manuela Veloso
TL;DR
The paper tackles binary sequence classification under severe data imbalance and variable-length sequences by introducing HMM-e, an ensemble of Hidden Markov Models trained on random data subsets. It defines a model-agnostic composite score $s(\u001cal O)=\sum_{i=1}^N\sum_{j=1}^M \mathbb{1}\{p(al O|\lambda_i^+) > p(al O|\lambda_j^-)\}$ to compare sequences without direct length normalization and demonstrates that HMM-e improves robustness and performance, even against CNNs and LSTMs on genomics benchmarks. The framework also supports downstream modeling by using HMM-e likelihoods as features for SVMs and NNs, with normalization to address sequence length effects, and it shows that HMM-e generally outperforms baselines, particularly in highly imbalanced data scenarios. Overall, HMM-e offers a practical, scalable approach to sequence classification with strong performance, interpretability, and compatibility with various downstream methods, with future work including ensemble pruning and synthetic data generation.
Abstract
We present a lightweight approach to sequence classification using Ensemble Methods for Hidden Markov Models (HMMs). HMMs offer significant advantages in scenarios with imbalanced or smaller datasets due to their simplicity, interpretability, and efficiency. These models are particularly effective in domains such as finance and biology, where traditional methods struggle with high feature dimensionality and varied sequence lengths. Our ensemble-based scoring method enables the comparison of sequences of any length and improves performance on imbalanced datasets. This study focuses on the binary classification problem, particularly in scenarios with data imbalance, where the negative class is the majority (e.g., normal data) and the positive class is the minority (e.g., anomalous data), often with extreme distribution skews. We propose a novel training approach for HMM Ensembles that generalizes to multi-class problems and supports classification and anomaly detection. Our method fits class-specific groups of diverse models using random data subsets, and compares likelihoods across classes to produce composite scores, achieving high average precisions and AUCs. In addition, we compare our approach with neural network-based methods such as Convolutional Neural Networks (CNNs) and Long Short-Term Memory networks (LSTMs), highlighting the efficiency and robustness of HMMs in data-scarce environments. Motivated by real-world use cases, our method demonstrates robust performance across various benchmarks, offering a flexible framework for diverse applications.
