Magika: AI-Powered Content-Type Detection

Yanick Fratantonio; Luca Invernizzi; Loua Farah; Kurt Thomas; Marina Zhang; Ange Albertini; Francois Galilee; Giancarlo Metitieri; Julien Cretin; Alex Petit-Bianco; David Tao; Elie Bursztein

Magika: AI-Powered Content-Type Detection

Yanick Fratantonio, Luca Invernizzi, Loua Farah, Kurt Thomas, Marina Zhang, Ange Albertini, Francois Galilee, Giancarlo Metitieri, Julien Cretin, Alex Petit-Bianco, David Tao, Elie Bursztein

TL;DR

This paper introduces Magika, a novel AI-powered content-type detection tool that employs a deep learning model that can execute on a single CPU with just 1MB of memory to store the model's weights, outperforming all existing content-type detection tools today.

Abstract

The task of content-type detection -- which entails identifying the data encoded in an arbitrary byte sequence -- is critical for operating systems, development, reverse engineering environments, and a variety of security applications. In this paper, we introduce Magika, a novel AI-powered content-type detection tool. Under the hood, Magika employs a deep learning model that can execute on a single CPU with just 1MB of memory to store the model's weights. We show that Magika achieves an average F1 score of 99% across over a hundred content types and a test set of more than 1M files, outperforming all existing content-type detection tools today. In order to foster adoption and improvements, we open source Magika under an Apache 2 license on GitHub and make our model and training pipeline publicly available. Our tool has already seen adoption by the Gmail email provider for attachment scanning, and it has been integrated with VirusTotal to aid with malware analysis. We note that this paper discusses the first iteration of Magika, and a more recent version already supports more than 200 content types. The interested reader can see the latest development on the Magika GitHub repository, available at https://github.com/google/magika.

Magika: AI-Powered Content-Type Detection

TL;DR

Abstract

Paper Structure (22 sections, 1 equation, 8 figures, 6 tables)

This paper contains 22 sections, 1 equation, 8 figures, 6 tables.

Introduction
Related work
Traditional approaches to content-type detection
Machine-learning approaches to content-type detection
Dataset
Benchmarking Existing Tools
Tools Selection
Metric selection
Automating Large-Scale Evaluations
Results
Magika
Requirements
Model Architecture
Training
Setting Confidence Thresholds
...and 7 more sections

Figures (8)

Figure 1: Example of the frailty of signature-based content-type detection when applied to distinct code snippets taken from "JavaScript Basics" of the Mozilla's MDN js-example. file---which relies on regular expressions for content-type detection---imprecisely labels each snippet as ASCII text unless the spaces around the "=" sign are removed. As we show, our proposed content-type detector Magika overcomes these robustness limitations.
Figure 2: List of content types in our dataset.
Figure 3: CDF of the sample sizes in our benchmark dataset.
Figure 4: Architecture of Magika. The input and the output are depicted in blue and green, respectively. The model's layers are in yellow. The layers in purple are used only in training. The numbers next to the layers' names indicate the size of their outputs.
Figure 5: Validation loss and validation accuracy as the training progresses in terms of the number of epochs. We find accuracy increases up to around 30 epochs.
...and 3 more figures

Magika: AI-Powered Content-Type Detection

TL;DR

Abstract

Magika: AI-Powered Content-Type Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (8)