Table of Contents
Fetching ...

A Unified Evaluation of Learning-Based Similarity Techniques for Malware Detection

Udbhav Prasad, Aniesh Chawla

TL;DR

The paper tackles the challenge of malware similarity by comparing traditional fuzzy hashing with learning-based embeddings under a unified, reproducible framework. It systematically benchmarks autoencoder-based embeddings, deep learning classifiers, and XGBoost-based leaf-index embeddings on large static PE metadata from EMBER and EmberSim, using consistent binary, multiclass, and similarity evaluations with Euclidean distance. Key findings show that unsupervised autoencoder embeddings substantially outperform fuzzy hashes in Top-K similarity and that deep learning embeddings yield the strongest clustering quality across similarity metrics, while XGBoost excels at binary and AVClass classification but provides weaker embedding-based similarity. The results underscore the need for hybrid malware analysis pipelines that combine complementary classification and similarity techniques to robustly triage and cluster threats at scale.

Abstract

Cryptographic digests (e.g., MD5, SHA-256) are designed to provide exact identity. Any single-bit change in the input produces a completely different hash, which is ideal for integrity verification but limits their usefulness in many real-world tasks like threat hunting, malware analysis and digital forensics, where adversaries routinely introduce minor transformations. Similarity-based techniques address this limitation by enabling approximate matching, allowing related byte sequences to produce measurably similar fingerprints. Modern enterprises manage tens of thousands of endpoints with billions of files, making the effectiveness and scalability of the proposed techniques more important than ever in security applications. Security researchers have proposed a range of approaches, including similarity digests and locality-sensitive hashes (e.g., ssdeep, sdhash, TLSH), as well as more recent machine-learning-based methods that generate embeddings from file features. However, these techniques have largely been evaluated in isolation, using disparate datasets and evaluation criteria. This paper presents a systematic comparison of learning-based classification and similarity methods using large, publicly available datasets. We evaluate each method under a unified experimental framework with industry-accepted metrics. To our knowledge, this is the first reproducible study to benchmark these diverse learning-based similarity techniques side by side for real-world security workloads. Our results show that no single approach performs well across all dimensions; instead, each exhibits distinct trade-offs, indicating that effective malware analysis and threat-hunting platforms must combine complementary classification and similarity techniques rather than rely on a single method.

A Unified Evaluation of Learning-Based Similarity Techniques for Malware Detection

TL;DR

The paper tackles the challenge of malware similarity by comparing traditional fuzzy hashing with learning-based embeddings under a unified, reproducible framework. It systematically benchmarks autoencoder-based embeddings, deep learning classifiers, and XGBoost-based leaf-index embeddings on large static PE metadata from EMBER and EmberSim, using consistent binary, multiclass, and similarity evaluations with Euclidean distance. Key findings show that unsupervised autoencoder embeddings substantially outperform fuzzy hashes in Top-K similarity and that deep learning embeddings yield the strongest clustering quality across similarity metrics, while XGBoost excels at binary and AVClass classification but provides weaker embedding-based similarity. The results underscore the need for hybrid malware analysis pipelines that combine complementary classification and similarity techniques to robustly triage and cluster threats at scale.

Abstract

Cryptographic digests (e.g., MD5, SHA-256) are designed to provide exact identity. Any single-bit change in the input produces a completely different hash, which is ideal for integrity verification but limits their usefulness in many real-world tasks like threat hunting, malware analysis and digital forensics, where adversaries routinely introduce minor transformations. Similarity-based techniques address this limitation by enabling approximate matching, allowing related byte sequences to produce measurably similar fingerprints. Modern enterprises manage tens of thousands of endpoints with billions of files, making the effectiveness and scalability of the proposed techniques more important than ever in security applications. Security researchers have proposed a range of approaches, including similarity digests and locality-sensitive hashes (e.g., ssdeep, sdhash, TLSH), as well as more recent machine-learning-based methods that generate embeddings from file features. However, these techniques have largely been evaluated in isolation, using disparate datasets and evaluation criteria. This paper presents a systematic comparison of learning-based classification and similarity methods using large, publicly available datasets. We evaluate each method under a unified experimental framework with industry-accepted metrics. To our knowledge, this is the first reproducible study to benchmark these diverse learning-based similarity techniques side by side for real-world security workloads. Our results show that no single approach performs well across all dimensions; instead, each exhibits distinct trade-offs, indicating that effective malware analysis and threat-hunting platforms must combine complementary classification and similarity techniques rather than rely on a single method.
Paper Structure (49 sections, 4 equations, 11 tables)