ClarAVy: A Tool for Scalable and Accurate Malware Family Labeling
Robert J. Joyce, Derek Everett, Maya Fuchs, Edward Raff, James Holt
TL;DR
ClarAVy tackles the challenging problem of automatically labeling malware families by improving three bottlenecks: AV parsing, alias resolution, and aggregation. It introduces a scalable Variational Bayesian framework (SparseIBCC) that aggregates heterogeneous AV detections while accounting for correlations between products and aliasing, enabling label predictions over tens of millions of samples and thousands of families. Empirically, ClarAVy outperforms established tools such as AVClass, AVClass++, EUPHONY, Sumav, and TagClass, achieving up to ~11–12 percentage points higher accuracy depending on the dataset, and provides a confidence score per labeling decision to facilitate high-fidelity labeling. The approach demonstrates practical scalability and industry relevance, labeling large real-world corpora and informing downstream tasks like machine learning classifier training, with plans to deploy at substantial daily throughput.
Abstract
Determining the family to which a malicious file belongs is an essential component of cyberattack investigation, attribution, and remediation. Performing this task manually is time consuming and requires expert knowledge. Automated tools using that label malware using antivirus detections lack accuracy and/or scalability, making them insufficient for real-world applications. Three pervasive shortcomings in these tools are responsible: (1) incorrect parsing of antivirus detections, (2) errors during family alias resolution, and (3) an inappropriate antivirus aggregation strategy. To address each of these, we created our own malware family labeling tool called ClarAVy. ClarAVy utilizes a Variational Bayesian approach to aggregate detections from a collection of antivirus products into accurate family labels. Our tool scales to enormous malware datasets, and we evaluated it by labeling $\approx$40 million malicious files. ClarAVy has 8 and 12 percentage points higher accuracy than the prior leading tool in labeling the MOTIF and MalPedia datasets, respectively.
