Table of Contents
Fetching ...

Magika: AI-Powered Content-Type Detection

Yanick Fratantonio, Luca Invernizzi, Loua Farah, Kurt Thomas, Marina Zhang, Ange Albertini, Francois Galilee, Giancarlo Metitieri, Julien Cretin, Alex Petit-Bianco, David Tao, Elie Bursztein

TL;DR

This paper introduces Magika, a novel AI-powered content-type detection tool that employs a deep learning model that can execute on a single CPU with just 1MB of memory to store the model's weights, outperforming all existing content-type detection tools today.

Abstract

The task of content-type detection -- which entails identifying the data encoded in an arbitrary byte sequence -- is critical for operating systems, development, reverse engineering environments, and a variety of security applications. In this paper, we introduce Magika, a novel AI-powered content-type detection tool. Under the hood, Magika employs a deep learning model that can execute on a single CPU with just 1MB of memory to store the model's weights. We show that Magika achieves an average F1 score of 99% across over a hundred content types and a test set of more than 1M files, outperforming all existing content-type detection tools today. In order to foster adoption and improvements, we open source Magika under an Apache 2 license on GitHub and make our model and training pipeline publicly available. Our tool has already seen adoption by the Gmail email provider for attachment scanning, and it has been integrated with VirusTotal to aid with malware analysis. We note that this paper discusses the first iteration of Magika, and a more recent version already supports more than 200 content types. The interested reader can see the latest development on the Magika GitHub repository, available at https://github.com/google/magika.

Magika: AI-Powered Content-Type Detection

TL;DR

This paper introduces Magika, a novel AI-powered content-type detection tool that employs a deep learning model that can execute on a single CPU with just 1MB of memory to store the model's weights, outperforming all existing content-type detection tools today.

Abstract

The task of content-type detection -- which entails identifying the data encoded in an arbitrary byte sequence -- is critical for operating systems, development, reverse engineering environments, and a variety of security applications. In this paper, we introduce Magika, a novel AI-powered content-type detection tool. Under the hood, Magika employs a deep learning model that can execute on a single CPU with just 1MB of memory to store the model's weights. We show that Magika achieves an average F1 score of 99% across over a hundred content types and a test set of more than 1M files, outperforming all existing content-type detection tools today. In order to foster adoption and improvements, we open source Magika under an Apache 2 license on GitHub and make our model and training pipeline publicly available. Our tool has already seen adoption by the Gmail email provider for attachment scanning, and it has been integrated with VirusTotal to aid with malware analysis. We note that this paper discusses the first iteration of Magika, and a more recent version already supports more than 200 content types. The interested reader can see the latest development on the Magika GitHub repository, available at https://github.com/google/magika.
Paper Structure (22 sections, 1 equation, 8 figures, 6 tables)

This paper contains 22 sections, 1 equation, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Example of the frailty of signature-based content-type detection when applied to distinct code snippets taken from "JavaScript Basics" of the Mozilla's MDN js-example. file---which relies on regular expressions for content-type detection---imprecisely labels each snippet as ASCII text unless the spaces around the "=" sign are removed. As we show, our proposed content-type detector Magika overcomes these robustness limitations.
  • Figure 2: List of content types in our dataset.
  • Figure 3: CDF of the sample sizes in our benchmark dataset.
  • Figure 4: Architecture of Magika. The input and the output are depicted in blue and green, respectively. The model's layers are in yellow. The layers in purple are used only in training. The numbers next to the layers' names indicate the size of their outputs.
  • Figure 5: Validation loss and validation accuracy as the training progresses in terms of the number of epochs. We find accuracy increases up to around 30 epochs.
  • ...and 3 more figures