On the Generalizability of Machine Learning-based Ransomware Detection in Block Storage
Nicolas Reategui, Roman Pletka, Dionysios Diamantopoulos
TL;DR
The paper tackles the generalizability of ML-based ransomware detection in block storage by proposing a kernel-based, lightweight IO-feature extraction pipeline that produces 82 features and uses an XGBoost classifier for real-time detection. It conducts a thorough generalizability study across volume states, file systems, copy-on-write VM images, device encryption, and real-world deployment, demonstrating robust performance with low false negatives and false positives when trained over diverse configurations. Key contributions include a hardware-friendly feature set, in-line inference architectures (including CSD-enabled storage controllers), and extensive empirical validation showing improved generalization over prior storage-based approaches. The findings highlight practical viability for storage-level ransomware detection in modern data centers, while also identifying areas where encryption and VM-related IO patterns require augmented training data and potentially alternative features to maintain high F1 scores in production. Overall, the work advances storage-security by delivering a scalable, low-overhead, generalizable ransomware detector that can operate inside storage systems or VMs, with clear implications for secure StaaS deployments and future research directions in handling encrypted and highly dynamic IO patterns.
Abstract
Ransomware represents a pervasive threat, traditionally countered at the operating system, file-system, or network levels. However, these approaches often introduce significant overhead and remain susceptible to circumvention by attackers. Recent research activity started looking into the detection of ransomware by observing block IO operations. However, this approach exhibits significant detection challenges. Recognizing these limitations, our research pivots towards enabling robust ransomware detection in storage systems keeping in mind their limited computational resources available. To perform our studies, we propose a kernel-based framework capable of efficiently extracting and analyzing IO operations to identify ransomware activity. The framework can be adopted to storage systems using computational storage devices to improve security and fully hide detection overheads. Our method employs a refined set of computationally light features optimized for ML models to accurately discern malicious from benign activities. Using this lightweight approach, we study a wide range of generalizability aspects and analyze the performance of these models across a large space of setups and configurations covering a wide range of realistic real-world scenarios. We reveal various trade-offs and provide strong arguments for the generalizability of storage-based detection of ransomware and show that our approach outperforms currently available ML-based ransomware detection in storage. Empirical validation reveals that our decision tree-based models achieve remarkable effectiveness, evidenced by higher median F1 scores of up to 12.8%, lower false negative rates of up to 10.9% and particularly decreased false positive rates of up to 17.1% compared to existing storage-based detection approaches.
