On the Abuse and Detection of Polyglot Files

Luke Koch; Sean Oesch; Amul Chaulagain; Jared Dixon; Matthew Dixon; Mike Huettal; Amir Sadovnik; Cory Watson; Brian Weber; Jacob Hartman; Richard Patulski

On the Abuse and Detection of Polyglot Files

Luke Koch, Sean Oesch, Amul Chaulagain, Jared Dixon, Matthew Dixon, Mike Huettal, Amir Sadovnik, Cory Watson, Brian Weber, Jacob Hartman, Richard Patulski

TL;DR

Polyglot files present a dual-format threat to format-aware malware detection and file-upload sanitization. The authors conduct the first wild usage survey, build Fazah to generate realistic polyglots, and train PolyConv to detect and label polyglots with PR-AUC $0.99998$ and F1 $99.20\%$ (polyglot) and $99.47\%$ (multi-label), outperforming existing tools. They also demonstrate ImSan, a content disarmament and reconstruction tool, achieving $100\%$ disarmament on tested image-based polyglots and provide actionable recommendations for detection, disarmament, and improved file-format specifications. The work offers practical defense guidance and identifies future research directions to harden defenses against polyglot abuse across formats and contexts.

Abstract

A polyglot is a file that is valid in two or more formats. Polyglot files pose a problem for malware detection systems that route files to format-specific detectors/signatures, as well as file upload and sanitization tools. In this work we found that existing file-format and embedded-file detection tools, even those developed specifically for polyglot files, fail to reliably detect polyglot files used in the wild, leaving organizations vulnerable to attack. To address this issue, we studied the use of polyglot files by malicious actors in the wild, finding $30$ polyglot samples and $15$ attack chains that leveraged polyglot files. In this report, we highlight two well-known APTs whose cyber attack chains relied on polyglot files to bypass detection mechanisms. Using knowledge from our survey of polyglot usage in the wild -- the first of its kind -- we created a novel data set based on adversary techniques. We then trained a machine learning detection solution, PolyConv, using this data set. PolyConv achieves a precision-recall area-under-curve score of $0.999$ with an F1 score of $99.20$% for polyglot detection and $99.47$% for file-format identification, significantly outperforming all other tools tested. We developed a content disarmament and reconstruction tool, ImSan, that successfully sanitized $100$% of the tested image-based polyglots, which were the most common type found via the survey. Our work provides concrete tools and suggestions to enable defenders to better defend themselves against polyglot files, as well as directions for future work to create more robust file specifications and methods of disarmament.

On the Abuse and Detection of Polyglot Files

TL;DR

and F1

(polyglot) and

(multi-label), outperforming existing tools. They also demonstrate ImSan, a content disarmament and reconstruction tool, achieving

disarmament on tested image-based polyglots and provide actionable recommendations for detection, disarmament, and improved file-format specifications. The work offers practical defense guidance and identifies future research directions to harden defenses against polyglot abuse across formats and contexts.

Abstract

polyglot samples and

attack chains that leveraged polyglot files. In this report, we highlight two well-known APTs whose cyber attack chains relied on polyglot files to bypass detection mechanisms. Using knowledge from our survey of polyglot usage in the wild -- the first of its kind -- we created a novel data set based on adversary techniques. We then trained a machine learning detection solution, PolyConv, using this data set. PolyConv achieves a precision-recall area-under-curve score of

with an F1 score of

% for polyglot detection and

% for file-format identification, significantly outperforming all other tools tested. We developed a content disarmament and reconstruction tool, ImSan, that successfully sanitized

% of the tested image-based polyglots, which were the most common type found via the survey. Our work provides concrete tools and suggestions to enable defenders to better defend themselves against polyglot files, as well as directions for future work to create more robust file specifications and methods of disarmament.

Paper Structure (38 sections, 12 figures, 4 tables)

This paper contains 38 sections, 12 figures, 4 tables.

Introduction
Related Work
Polyglot Detection
Polyglot Creation
Polyglot Exploitation
RQ1: Polyglot Exploitation in the Wild
Survey Methods
Role of Polyglot Files in Cyber Attack Chains
IcedID
Andariel/Lazarus
Wild Polyglots: A Polyglot Data Set Based on Malicious Usage in the Wild
Fazah: A Polyglot Generation Framework
Wild Polyglots Data Set Creation and Contents
RQ2: Using Machine Learning for Polyglot Detection
Ml-based Detection Development
...and 23 more sections

Figures (12)

Figure 1: Functionality of a polyglot file is determined by the calling program, which can be explicitly provided or automatically determined by the operating system's auto-launch settings.
Figure 2: Since polyglot files simultaneously conform to multiple formats, they can evade correct format identification. This in turn allows them to evade format-specific feature extraction or signature matching, thereby evading malware detection. Therefore, some preprocessing should be done to either filter/quarantine polyglot files prior to feature extraction or route them to multiple format-specific malware detectors so all functional components of the polyglot are analyzed.
Figure 3: IcedID Attack Chain
Figure 4: Andariel/Lazarus Attack Chain
Figure 5: File counts for the monoglot formats in the Wild Polyglots training data.
...and 7 more figures

On the Abuse and Detection of Polyglot Files

TL;DR

Abstract

On the Abuse and Detection of Polyglot Files

Authors

TL;DR

Abstract

Table of Contents

Figures (12)