Table of Contents
Fetching ...

ClarAVy: A Tool for Scalable and Accurate Malware Family Labeling

Robert J. Joyce, Derek Everett, Maya Fuchs, Edward Raff, James Holt

TL;DR

ClarAVy tackles the challenging problem of automatically labeling malware families by improving three bottlenecks: AV parsing, alias resolution, and aggregation. It introduces a scalable Variational Bayesian framework (SparseIBCC) that aggregates heterogeneous AV detections while accounting for correlations between products and aliasing, enabling label predictions over tens of millions of samples and thousands of families. Empirically, ClarAVy outperforms established tools such as AVClass, AVClass++, EUPHONY, Sumav, and TagClass, achieving up to ~11–12 percentage points higher accuracy depending on the dataset, and provides a confidence score per labeling decision to facilitate high-fidelity labeling. The approach demonstrates practical scalability and industry relevance, labeling large real-world corpora and informing downstream tasks like machine learning classifier training, with plans to deploy at substantial daily throughput.

Abstract

Determining the family to which a malicious file belongs is an essential component of cyberattack investigation, attribution, and remediation. Performing this task manually is time consuming and requires expert knowledge. Automated tools using that label malware using antivirus detections lack accuracy and/or scalability, making them insufficient for real-world applications. Three pervasive shortcomings in these tools are responsible: (1) incorrect parsing of antivirus detections, (2) errors during family alias resolution, and (3) an inappropriate antivirus aggregation strategy. To address each of these, we created our own malware family labeling tool called ClarAVy. ClarAVy utilizes a Variational Bayesian approach to aggregate detections from a collection of antivirus products into accurate family labels. Our tool scales to enormous malware datasets, and we evaluated it by labeling $\approx$40 million malicious files. ClarAVy has 8 and 12 percentage points higher accuracy than the prior leading tool in labeling the MOTIF and MalPedia datasets, respectively.

ClarAVy: A Tool for Scalable and Accurate Malware Family Labeling

TL;DR

ClarAVy tackles the challenging problem of automatically labeling malware families by improving three bottlenecks: AV parsing, alias resolution, and aggregation. It introduces a scalable Variational Bayesian framework (SparseIBCC) that aggregates heterogeneous AV detections while accounting for correlations between products and aliasing, enabling label predictions over tens of millions of samples and thousands of families. Empirically, ClarAVy outperforms established tools such as AVClass, AVClass++, EUPHONY, Sumav, and TagClass, achieving up to ~11–12 percentage points higher accuracy depending on the dataset, and provides a confidence score per labeling decision to facilitate high-fidelity labeling. The approach demonstrates practical scalability and industry relevance, labeling large real-world corpora and informing downstream tasks like machine learning classifier training, with plans to deploy at substantial daily throughput.

Abstract

Determining the family to which a malicious file belongs is an essential component of cyberattack investigation, attribution, and remediation. Performing this task manually is time consuming and requires expert knowledge. Automated tools using that label malware using antivirus detections lack accuracy and/or scalability, making them insufficient for real-world applications. Three pervasive shortcomings in these tools are responsible: (1) incorrect parsing of antivirus detections, (2) errors during family alias resolution, and (3) an inappropriate antivirus aggregation strategy. To address each of these, we created our own malware family labeling tool called ClarAVy. ClarAVy utilizes a Variational Bayesian approach to aggregate detections from a collection of antivirus products into accurate family labels. Our tool scales to enormous malware datasets, and we evaluated it by labeling 40 million malicious files. ClarAVy has 8 and 12 percentage points higher accuracy than the prior leading tool in labeling the MOTIF and MalPedia datasets, respectively.

Paper Structure

This paper contains 28 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Fictitious AV scan report with detections from 9 AV products. AV products 1-6 correctly detect that this file belongs to the Andromeda family, which has the aliases Androm, Gamarue, and Wauchos. AV 7 predicts that the malware is a trojan and a backdoor. AV 8 uses an ML-based approach to detect the file as malicious. The heuristic used by AV 9 results in an incorrect detection as the Zbot family.
  • Figure 2: Parsing the detection Exploit:Win32/MS08067.xyz. ClarAVy tokenizes it and selects a parsing function based on the structure TOK:TOK/TOK.TOK. A regular expression in the parsing function finds that the third token is the MSO8-067 vulnerability, rather than a family name.
  • Figure 3: Known relationships between AV products joyce2023maldict
  • Figure 4: Plate Diagram for VB-IBCC IBCC
  • Figure 5: Accuracy of ClarAVy if scans below the given confidence are ignored. No scans had a confidence of 92% or higher. ClarAVy has more than 90% accuracy on MOTIF when using a confidence threshold of 70%.
  • ...and 2 more figures