On the Generalizability of Machine Learning-based Ransomware Detection in Block Storage

Nicolas Reategui; Roman Pletka; Dionysios Diamantopoulos

On the Generalizability of Machine Learning-based Ransomware Detection in Block Storage

Nicolas Reategui, Roman Pletka, Dionysios Diamantopoulos

TL;DR

The paper tackles the generalizability of ML-based ransomware detection in block storage by proposing a kernel-based, lightweight IO-feature extraction pipeline that produces 82 features and uses an XGBoost classifier for real-time detection. It conducts a thorough generalizability study across volume states, file systems, copy-on-write VM images, device encryption, and real-world deployment, demonstrating robust performance with low false negatives and false positives when trained over diverse configurations. Key contributions include a hardware-friendly feature set, in-line inference architectures (including CSD-enabled storage controllers), and extensive empirical validation showing improved generalization over prior storage-based approaches. The findings highlight practical viability for storage-level ransomware detection in modern data centers, while also identifying areas where encryption and VM-related IO patterns require augmented training data and potentially alternative features to maintain high F1 scores in production. Overall, the work advances storage-security by delivering a scalable, low-overhead, generalizable ransomware detector that can operate inside storage systems or VMs, with clear implications for secure StaaS deployments and future research directions in handling encrypted and highly dynamic IO patterns.

Abstract

Ransomware represents a pervasive threat, traditionally countered at the operating system, file-system, or network levels. However, these approaches often introduce significant overhead and remain susceptible to circumvention by attackers. Recent research activity started looking into the detection of ransomware by observing block IO operations. However, this approach exhibits significant detection challenges. Recognizing these limitations, our research pivots towards enabling robust ransomware detection in storage systems keeping in mind their limited computational resources available. To perform our studies, we propose a kernel-based framework capable of efficiently extracting and analyzing IO operations to identify ransomware activity. The framework can be adopted to storage systems using computational storage devices to improve security and fully hide detection overheads. Our method employs a refined set of computationally light features optimized for ML models to accurately discern malicious from benign activities. Using this lightweight approach, we study a wide range of generalizability aspects and analyze the performance of these models across a large space of setups and configurations covering a wide range of realistic real-world scenarios. We reveal various trade-offs and provide strong arguments for the generalizability of storage-based detection of ransomware and show that our approach outperforms currently available ML-based ransomware detection in storage. Empirical validation reveals that our decision tree-based models achieve remarkable effectiveness, evidenced by higher median F1 scores of up to 12.8%, lower false negative rates of up to 10.9% and particularly decreased false positive rates of up to 17.1% compared to existing storage-based detection approaches.

On the Generalizability of Machine Learning-based Ransomware Detection in Block Storage

TL;DR

Abstract

Paper Structure (21 sections, 12 figures, 4 tables)

This paper contains 21 sections, 12 figures, 4 tables.

Introduction
Methodology
Challenges
Architectural overview
Security Aspects
Feature Engineering
Experimental setup
Benign and ransomware workloads
Basic generalizability aspects
Volume state generalizability
File-system generalizability
Benign workloads generalizability
Copy-on-write effects from VM images
Device encryption
LUKS encryption
...and 6 more sections

Figures (12)

Figure 1: (a) Linux-based ransomware detection in user space. (b) Hypervisor-based detection of ransomware on guest OS. (c) Storage system with integrated detection capabilities using CSD SSDs.
Figure 2: Results for different volume states (high and medium utilization, long-term effects, and combining all). F1 scores are evaluated across all hold-out test sets using traces from an XFS file system and training XGBoost models.
Figure 3: Generalizability across file system types. F1 scores of XGBoost models trained and evaluated on either XFS, EXT4, and NTFS.
Figure 4: FPR of benign workloads trained on either baseline, MySQL, PGSQL or combined with XFS using our features.
Figure 5: FPR of benign workloads trained with XFS on either baseline, MySQL, PGSQL, and combined traces using features from hirano2019machine.
...and 7 more figures

On the Generalizability of Machine Learning-based Ransomware Detection in Block Storage

TL;DR

Abstract

On the Generalizability of Machine Learning-based Ransomware Detection in Block Storage

Authors

TL;DR

Abstract

Table of Contents

Figures (12)