Table of Contents
Fetching ...

Hydra: Robust Hardware-Assisted Malware Detection

Eli Propp, Seyed Majid Zahedi

TL;DR

Hydra tackles the robustness gap in hardware-assisted malware detection caused by monitoring a small, fixed set of hardware events by scheduling multiple complementary feature sets across time slices. It introduces a formal sequence-learning framework that aggregates per-slice detector outputs in log-odds space and optimizes a simplex-constrained mixture of sequences offline, then deploys the learned schedule online with reconfigurable HPCs. Empirically, Hydra achieves a 19.32% uplift in F1 and a 60.23% reduction in false positives over the best single-feature baselines, while maintaining strong accuracy (0.971) and a low FPR (0.031). The approach demonstrates that broadening feature-set coverage over time can overcome the inherent limitations of hardware counters, with practical implications for robust, low-overhead malware detection.

Abstract

Malware detection using Hardware Performance Counters (HPCs) offers a promising, low-overhead approach for monitoring program behavior. However, a fundamental architectural constraint, that only a limited number of hardware events can be monitored concurrently, creates a significant bottleneck, leading to detection blind spots. Prior work has primarily focused on optimizing machine learning models for a single, statically chosen event set, or on ensembling models over the same feature set. We argue that robustness requires diversifying not only the models, but also the underlying feature sets (i.e., the monitored hardware events) in order to capture a broader spectrum of program behavior. This observation motivates the following research question: Can detection performance be improved by trading temporal granularity for broader coverage, via the strategic scheduling of different feature sets over time? To answer this question, we propose Hydra, a novel detection mechanism that partitions execution traces into time slices and learns an effective schedule of feature sets and corresponding classifiers for deployment. By cycling through complementary feature sets, Hydra mitigates the limitations of a fixed monitoring perspective. Our experimental evaluation shows that Hydra significantly outperforms state-of-the-art single-feature-set baselines, achieving a 19.32% improvement in F1 score and a 60.23% reduction in false positive rate. These results underscore the importance of feature-set diversity and establish strategic multi-feature-set scheduling as an effective principle for robust, hardware-assisted malware detection.

Hydra: Robust Hardware-Assisted Malware Detection

TL;DR

Hydra tackles the robustness gap in hardware-assisted malware detection caused by monitoring a small, fixed set of hardware events by scheduling multiple complementary feature sets across time slices. It introduces a formal sequence-learning framework that aggregates per-slice detector outputs in log-odds space and optimizes a simplex-constrained mixture of sequences offline, then deploys the learned schedule online with reconfigurable HPCs. Empirically, Hydra achieves a 19.32% uplift in F1 and a 60.23% reduction in false positives over the best single-feature baselines, while maintaining strong accuracy (0.971) and a low FPR (0.031). The approach demonstrates that broadening feature-set coverage over time can overcome the inherent limitations of hardware counters, with practical implications for robust, low-overhead malware detection.

Abstract

Malware detection using Hardware Performance Counters (HPCs) offers a promising, low-overhead approach for monitoring program behavior. However, a fundamental architectural constraint, that only a limited number of hardware events can be monitored concurrently, creates a significant bottleneck, leading to detection blind spots. Prior work has primarily focused on optimizing machine learning models for a single, statically chosen event set, or on ensembling models over the same feature set. We argue that robustness requires diversifying not only the models, but also the underlying feature sets (i.e., the monitored hardware events) in order to capture a broader spectrum of program behavior. This observation motivates the following research question: Can detection performance be improved by trading temporal granularity for broader coverage, via the strategic scheduling of different feature sets over time? To answer this question, we propose Hydra, a novel detection mechanism that partitions execution traces into time slices and learns an effective schedule of feature sets and corresponding classifiers for deployment. By cycling through complementary feature sets, Hydra mitigates the limitations of a fixed monitoring perspective. Our experimental evaluation shows that Hydra significantly outperforms state-of-the-art single-feature-set baselines, achieving a 19.32% improvement in F1 score and a 60.23% reduction in false positive rate. These results underscore the importance of feature-set diversity and establish strategic multi-feature-set scheduling as an effective principle for robust, hardware-assisted malware detection.
Paper Structure (28 sections, 8 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 28 sections, 8 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Performance of individual models. Each plot shows the performance distribution of a given model family (e.g., DT, RF) for a specific metric. Points are color-coded by feature set across all plots. The box denotes the interquartile range (middle 50%), while the whiskers indicate the non-outlier range. For the rightmost plot (FPR), lower values indicate better performance.
  • Figure 2: Percentage of caught mistakes. The heatmap illustrates mistake coverage across ensemble baselines. For each row–column pair $(i, j)$, the value indicates the percentage of mistakes made by baseline $i$ that are correctly classified by baseline $j$. The diagonal entries are zero, since a baseline cannot, by definition, correct its own errors.
  • Figure 3: Percent-improvement boxplots of Hydra using individual baselines relative to all baselines. Values above zero indicate improved performance. For FPR (far right), positive values indicate a reduction in false positive rate achieved by Hydra relative to the baselines.
  • Figure 4: Performance comparison of Hydra using the logistic and MSE objective functions. Both configurations employ mean logit probability aggregation and no regularization.
  • Figure 5: Performance of Hydra when using 20% (blue) and 10% of the available training data for sequence training.
  • ...and 1 more figures