MASCOT: Analyzing Malware Evolution Through A Well-Curated Source Code Dataset
Bojing Li, Duo Zhong, Dharani Nadendla, Gabriel Terceros, Prajna Bhandar, Raguvir S, Charles Nicholas
TL;DR
The paper introduces MASCOT, a large manually reviewed malware source-code dataset and a dual-analysis framework combining software engineering metrics with malware genealogy to study evolution. It provides eight labels per specimen, timestamps, and a visualization toolkit to analyze evolution from both macro and fine-grained perspectives. Key findings show increasing standardization and complexity in malware development, with modern samples being smaller yet higher-quality, and that code reuse propagates vulnerabilities across lineages. The dataset and tools enable reproducible malware-evolution research and offer practical insights for defenders through fine-grained code-reuse and vulnerability inheritance analyses.
Abstract
In recent years, the explosion of malware and extensive code reuse have formed complex evolutionary connections among malware specimens. The rapid pace of development makes it challenging for existing studies to characterize recent evolutionary trends. In addition, intuitive tools to untangle these intricate connections between malware specimens or categories are urgently needed. This paper introduces a manually-reviewed malware source code dataset containing 6032 specimens. Building on and extending current research from a software engineering perspective, we systematically evaluate the scale, development costs, code quality, as well as security and dependencies of modern malware. We further introduce a multi-view genealogy analysis to clarify malware connections: at an overall view, this analysis quantifies the strength and direction of connections among specimens and categories; at a detailed view, it traces the evolutionary histories of individual specimens. Experimental results indicate that, despite persistent shortcomings in code quality, malware specimens exhibit an increasing complexity and standardization, in step with the development of mainstream software engineering practices. Meanwhile, our genealogy analysis intuitively reveals lineage expansion and evolution driven by code reuse, providing new evidence and tools for understanding the formation and evolution of the malware ecosystem.
