Table of Contents
Fetching ...

Ensemble Methods for Sequence Classification with Hidden Markov Models

Maxime Kawawa-Beaudan, Srijan Sood, Soham Palande, Ganapathy Mani, Tucker Balch, Manuela Veloso

TL;DR

The paper tackles binary sequence classification under severe data imbalance and variable-length sequences by introducing HMM-e, an ensemble of Hidden Markov Models trained on random data subsets. It defines a model-agnostic composite score $s(\u001cal O)=\sum_{i=1}^N\sum_{j=1}^M \mathbb{1}\{p(al O|\lambda_i^+) > p(al O|\lambda_j^-)\}$ to compare sequences without direct length normalization and demonstrates that HMM-e improves robustness and performance, even against CNNs and LSTMs on genomics benchmarks. The framework also supports downstream modeling by using HMM-e likelihoods as features for SVMs and NNs, with normalization to address sequence length effects, and it shows that HMM-e generally outperforms baselines, particularly in highly imbalanced data scenarios. Overall, HMM-e offers a practical, scalable approach to sequence classification with strong performance, interpretability, and compatibility with various downstream methods, with future work including ensemble pruning and synthetic data generation.

Abstract

We present a lightweight approach to sequence classification using Ensemble Methods for Hidden Markov Models (HMMs). HMMs offer significant advantages in scenarios with imbalanced or smaller datasets due to their simplicity, interpretability, and efficiency. These models are particularly effective in domains such as finance and biology, where traditional methods struggle with high feature dimensionality and varied sequence lengths. Our ensemble-based scoring method enables the comparison of sequences of any length and improves performance on imbalanced datasets. This study focuses on the binary classification problem, particularly in scenarios with data imbalance, where the negative class is the majority (e.g., normal data) and the positive class is the minority (e.g., anomalous data), often with extreme distribution skews. We propose a novel training approach for HMM Ensembles that generalizes to multi-class problems and supports classification and anomaly detection. Our method fits class-specific groups of diverse models using random data subsets, and compares likelihoods across classes to produce composite scores, achieving high average precisions and AUCs. In addition, we compare our approach with neural network-based methods such as Convolutional Neural Networks (CNNs) and Long Short-Term Memory networks (LSTMs), highlighting the efficiency and robustness of HMMs in data-scarce environments. Motivated by real-world use cases, our method demonstrates robust performance across various benchmarks, offering a flexible framework for diverse applications.

Ensemble Methods for Sequence Classification with Hidden Markov Models

TL;DR

The paper tackles binary sequence classification under severe data imbalance and variable-length sequences by introducing HMM-e, an ensemble of Hidden Markov Models trained on random data subsets. It defines a model-agnostic composite score to compare sequences without direct length normalization and demonstrates that HMM-e improves robustness and performance, even against CNNs and LSTMs on genomics benchmarks. The framework also supports downstream modeling by using HMM-e likelihoods as features for SVMs and NNs, with normalization to address sequence length effects, and it shows that HMM-e generally outperforms baselines, particularly in highly imbalanced data scenarios. Overall, HMM-e offers a practical, scalable approach to sequence classification with strong performance, interpretability, and compatibility with various downstream methods, with future work including ensemble pruning and synthetic data generation.

Abstract

We present a lightweight approach to sequence classification using Ensemble Methods for Hidden Markov Models (HMMs). HMMs offer significant advantages in scenarios with imbalanced or smaller datasets due to their simplicity, interpretability, and efficiency. These models are particularly effective in domains such as finance and biology, where traditional methods struggle with high feature dimensionality and varied sequence lengths. Our ensemble-based scoring method enables the comparison of sequences of any length and improves performance on imbalanced datasets. This study focuses on the binary classification problem, particularly in scenarios with data imbalance, where the negative class is the majority (e.g., normal data) and the positive class is the minority (e.g., anomalous data), often with extreme distribution skews. We propose a novel training approach for HMM Ensembles that generalizes to multi-class problems and supports classification and anomaly detection. Our method fits class-specific groups of diverse models using random data subsets, and compares likelihoods across classes to produce composite scores, achieving high average precisions and AUCs. In addition, we compare our approach with neural network-based methods such as Convolutional Neural Networks (CNNs) and Long Short-Term Memory networks (LSTMs), highlighting the efficiency and robustness of HMMs in data-scarce environments. Motivated by real-world use cases, our method demonstrates robust performance across various benchmarks, offering a flexible framework for diverse applications.
Paper Structure (44 sections, 6 equations, 5 figures, 2 tables, 6 algorithms)

This paper contains 44 sections, 6 equations, 5 figures, 2 tables, 6 algorithms.

Figures (5)

  • Figure 1: Flow diagram of our HMM-e ensemble training and inference approach, as detailed in Section \ref{['sec:hmm_ensembles']}. While we adopt this approach using HMMs, the framework itself is model agnostic. The training data is broken into random subsets, and a diverse ensemble of learners is trained on these subsets. At inference time, pairwise matchups of likelihoods given by the models are compared, giving the composite score $s$.
  • Figure 2: The distribution of composite scores for test data in demo_coding_vs_intergenomic_seqs, using a 250-model ensemble in an imbalanced data setting with class ratio 50:1. We observe good class separation even with data imbalance.
  • Figure 3: Pairwise inter-HMM similarities for models in the 250-model ensemble on dataset demo_human_or_worm. In the properly configured setting (left) models are trained from unique initializations for 25 iterations. In the degenerate setting (right) subsets of models are trained from identical initializations for 5 iterations. We see that without sufficient diversity the ensemble may learn redundant models, effectively reducing the ensemble size.
  • Figure 4: UMAP embeddings of training data from dataset demo_human_or_worm, using the likelihood vectors from HMM-e containing 250 sub-models for each class.
  • Figure 5: Impact of ensemble size and class imbalance on classifier performance. An increase in class imbalance causes significant decay in AP, as expected, along with a misleading increase in AUC-ROC. Larger ensemble sizes tend to perform better than smaller ones.