On the Role of Similarity in Detecting Masquerading Files

Jonathan Oliver; Jue Mo; Susmit Yenkar; Raghav Batta; Sekhar Josyoula

On the Role of Similarity in Detecting Masquerading Files

Jonathan Oliver, Jue Mo, Susmit Yenkar, Raghav Batta, Sekhar Josyoula

TL;DR

The paper tackles masquerading files that resemble legitimate software, which can undermine similarity-based security solutions. It combines a taxonomy of masquerading types with a real-world, data-driven approach that uses TLSH-based clustering on the Malware Bazaar dataset to detect masquerading, producing a representative set of $703$ candidates from $700000$ samples under a $30$-distance threshold. It demonstrates multiple masquerading scenarios (no signature, not verified, residual X509, revoked certificates, malware-signed certificates, and untrusted roots) and shows how clustering can reveal anomalies that signatures alone may miss. The work argues for integrating digital signatures into ML/similarity pipelines, employing signature-based rules to detect compromised certificates, and highlights the need for training regimes that include masquerading variants, while acknowledging unsigned files and the limits of detection in supply-chain contexts. Overall, the study provides practical guidance for combining similarity, clustering, and code-signing signals to improve detection of masquerading files in executable security systems.

Abstract

Similarity has been applied to a wide range of security applications, typically used in machine learning models. We examine the problem posed by masquerading samples; that is samples crafted by bad actors to be similar or near identical to legitimate samples. We find that these samples potentially create significant problems for machine learning solutions. The primary problem being that bad actors can circumvent machine learning solutions by using masquerading samples. We then examine the interplay between digital signatures and machine learning solutions. In particular, we focus on executable files and code signing. We offer a taxonomy for masquerading files. We use a combination of similarity and clustering to find masquerading files. We use the insights gathered in this process to offer improvements to similarity based and machine learning security solutions.

On the Role of Similarity in Detecting Masquerading Files

TL;DR

candidates from

samples under a

-distance threshold. It demonstrates multiple masquerading scenarios (no signature, not verified, residual X509, revoked certificates, malware-signed certificates, and untrusted roots) and shows how clustering can reveal anomalies that signatures alone may miss. The work argues for integrating digital signatures into ML/similarity pipelines, employing signature-based rules to detect compromised certificates, and highlights the need for training regimes that include masquerading variants, while acknowledging unsigned files and the limits of detection in supply-chain contexts. Overall, the study provides practical guidance for combining similarity, clustering, and code-signing signals to improve detection of masquerading files in executable security systems.

Abstract

Paper Structure (12 sections, 1 table)

This paper contains 12 sections, 1 table.

Introduction
A Taxonomy of Masquerading Files
Collecting Masquerading Files
The No Signature Case
The Not Verified Case
The Contains a X509 Certificate Case
The Certificate Revoked Case
The Certificate Used for Signing Malware Case
The No Trusted Root Authority Case
Finding Masquerading Files in Clusters
Clusters Related to Supply Chain Attacks
Conclusion and Future Work

On the Role of Similarity in Detecting Masquerading Files

TL;DR

Abstract

On the Role of Similarity in Detecting Masquerading Files

Authors

TL;DR

Abstract

Table of Contents