Table of Contents
Fetching ...

MLRan: A Behavioural Dataset for Ransomware Analysis and Detection

Faithful Chiagoziem Onwuegbuche, Adelodun Olaoluwa, Anca Delia Jurcut, Liliana Pasquale

TL;DR

MLRan tackles the scarcity and narrow scope of public ransomware datasets by delivering the largest open behavioural dataset (4880 samples, 64 families) with nine feature groups and balanced goodware. It introduces GUIDE-MLRan guidelines to standardise reproducible dataset construction and demonstrates a two-stage feature selection that reduces 6.4 million features to 483 without sacrificing accuracy, achieving up to 98% performance with efficient computation. SHAP and LIME analyses reveal API usage, strings, registry, and memory-related behaviours as key ransomware indicators, offering actionable interpretability. The authors provide an open-source, end-to-end pipeline (dynamic analysis via Cuckoo Sandbox, feature extraction, selection, ML training, and evaluation) to support replicability and future research in ransomware detection.

Abstract

Ransomware remains a critical threat to cybersecurity, yet publicly available datasets for training machine learning-based ransomware detection models are scarce and often have limited sample size, diversity, and reproducibility. In this paper, we introduce MLRan, a behavioural ransomware dataset, comprising over 4,800 samples across 64 ransomware families and a balanced set of goodware samples. The samples span from 2006 to 2024 and encompass the four major types of ransomware: locker, crypto, ransomware-as-a-service, and modern variants. We also propose guidelines (GUIDE-MLRan), inspired by previous work, for constructing high-quality behavioural ransomware datasets, which informed the curation of our dataset. We evaluated the ransomware detection performance of several machine learning (ML) models using MLRan. For this purpose, we performed feature selection by conducting mutual information filtering to reduce the initial 6.4 million features to 24,162, followed by recursive feature elimination, yielding 483 highly informative features. The ML models achieved an accuracy, precision and recall of up to 98.7%, 98.9%, 98.5%, respectively. Using SHAP and LIME, we identified critical indicators of malicious behaviour, including registry tampering, strings, and API misuse. The dataset and source code for feature extraction, selection, ML training, and evaluation are available publicly to support replicability and encourage future research, which can be found at https://github.com/faithfulco/mlran.

MLRan: A Behavioural Dataset for Ransomware Analysis and Detection

TL;DR

MLRan tackles the scarcity and narrow scope of public ransomware datasets by delivering the largest open behavioural dataset (4880 samples, 64 families) with nine feature groups and balanced goodware. It introduces GUIDE-MLRan guidelines to standardise reproducible dataset construction and demonstrates a two-stage feature selection that reduces 6.4 million features to 483 without sacrificing accuracy, achieving up to 98% performance with efficient computation. SHAP and LIME analyses reveal API usage, strings, registry, and memory-related behaviours as key ransomware indicators, offering actionable interpretability. The authors provide an open-source, end-to-end pipeline (dynamic analysis via Cuckoo Sandbox, feature extraction, selection, ML training, and evaluation) to support replicability and future research in ransomware detection.

Abstract

Ransomware remains a critical threat to cybersecurity, yet publicly available datasets for training machine learning-based ransomware detection models are scarce and often have limited sample size, diversity, and reproducibility. In this paper, we introduce MLRan, a behavioural ransomware dataset, comprising over 4,800 samples across 64 ransomware families and a balanced set of goodware samples. The samples span from 2006 to 2024 and encompass the four major types of ransomware: locker, crypto, ransomware-as-a-service, and modern variants. We also propose guidelines (GUIDE-MLRan), inspired by previous work, for constructing high-quality behavioural ransomware datasets, which informed the curation of our dataset. We evaluated the ransomware detection performance of several machine learning (ML) models using MLRan. For this purpose, we performed feature selection by conducting mutual information filtering to reduce the initial 6.4 million features to 24,162, followed by recursive feature elimination, yielding 483 highly informative features. The ML models achieved an accuracy, precision and recall of up to 98.7%, 98.9%, 98.5%, respectively. Using SHAP and LIME, we identified critical indicators of malicious behaviour, including registry tampering, strings, and API misuse. The dataset and source code for feature extraction, selection, ML training, and evaluation are available publicly to support replicability and encourage future research, which can be found at https://github.com/faithfulco/mlran.

Paper Structure

This paper contains 69 sections, 13 equations, 16 figures, 9 tables, 2 algorithms.

Figures (16)

  • Figure 1: Distribution of software sample types in the MLRan dataset. The dataset contains a total of 4880 samples, split into 2550 (52.25%) Goodware and 2330 (47.75%) Ransomware. The dataset is relatively balanced, with only a slight difference between the two categories.
  • Figure 2: Distribution of ransomware types in the MLRan dataset. The dataset contains a total of 2330 Ransomware samples, split into 1140 (48.92%) Crypto, 468 (20.08%) RaaS, 449 (19.27%) Modern, and 273 (11.72%) Locker.
  • Figure 3: Distribution of ransomware families, colour-coded by their respective ransomware types. The numbers on the bars represent the number of samples from each ransomware family found in the MLRan Dataset. The dataset includes a total of 64 ransomware families, classified into four categories, as shown in the legend and colour-coded in the bars: 32 families belong to the Crypto type, 15 are Modern, 13 are RaaS, and 4 are Locker.
  • Figure 4: Distribution of goodware sample categories in the MLRan Dataset. The goodware samples contain 11 categories, with the Most Popular category having the highest sample count of 1212 samples, representing 47.53% of the total. The Productivity category follows with 187 samples, accounting for 7.33%. The exact number of samples for each category is displayed on the corresponding bars.
  • Figure 5: The stacked bar chart illustrates the distribution of samples in the MLRan dataset across different years, based on their first submission to VirusTotal, segmented by sample type (Goodware and Ransomware). The first submission timestamp from VirusTotal was used as it provides more reliable temporal information compared to creation dates, which are often manipulated by malware authors. Between 2006 and 2011, the dataset predominantly consisted of Goodware samples. Starting in 2012, a steady increase in Ransomware samples was observed, while Goodware samples remained relatively stable, except for a notable spike in Ransomware during 2020 and particularly in 2021, likely driven by the surge in cybercrime during the COVID-19 pandemic.
  • ...and 11 more figures