DNS-GT: A Graph-based Transformer Approach to Learn Embeddings of Domain Names from DNS Queries

Massimiliano Altieri; Ronan Hamon; Roberto Corizzo; Michelangelo Ceci; Ignacio Sanchez

DNS-GT: A Graph-based Transformer Approach to Learn Embeddings of Domain Names from DNS Queries

Massimiliano Altieri, Ronan Hamon, Roberto Corizzo, Michelangelo Ceci, Ignacio Sanchez

Abstract

Network intrusion detection systems play a crucial role in the security strategy employed by organisations to detect and prevent cyberattacks. Such systems usually combine pattern detection signatures with anomaly detection techniques powered by machine learning methods. However, the commonly proposed machine learning methods present drawbacks such as over-reliance on labeled data and limited generalization capabilities. To address these issues, embedding-based methods have been introduced to learn representations from network data, such as DNS traffic, mainly due to its large availability, that generalise effectively to many downstream tasks. However, current approaches do not properly consider contextual information among DNS queries. In this paper, we tackle this issue by proposing DNS-GT, a novel Transformer-based model that learns embeddings for domain names from sequences of DNS queries. The model is first pre-trained in a self-supervised fashion in order to learn the general behavior of DNS activity. Then, it can be finetuned on specific downstream tasks, exploiting interactions with other relevant queries in a given sequence. Our experiments with real-world DNS data showcase the ability of our method to learn effective domain name representations. A quantitative evaluation on domain name classification and botnet detection tasks shows that our approach achieves better results compared to relevant baselines, creating opportunities for further exploration of large-scale language models for intrusion detection systems. Our code is available at: https://github.com/m-altieri/DNS-GT.

DNS-GT: A Graph-based Transformer Approach to Learn Embeddings of Domain Names from DNS Queries

Abstract

Paper Structure (26 sections, 15 equations, 14 figures, 5 tables)

This paper contains 26 sections, 15 equations, 14 figures, 5 tables.

Introduction
Related Work
Method
Sequencing
Knowledge-based Topologies
Modeling
Applications
Experiments
Dataset
Setup
Results and Discussion
Conclusion
Preliminaries
Pre-training Stage
Finetuning and Classification Stages
...and 11 more sections

Figures (14)

Figure 1: Conceptual overview of domain classification task with model training phases: pre-training (a) and fine-tuning (b).
Figure 2: Workflow of the proposed DNS-GT method. Each stage is represented in red. Input and output data are depicted in green.
Figure 3: Model architecture. The colour is used to emphasise model input and output (green), tensors (yellow), operations (red) and learnable neural networks (blue).
Figure 4: ROC curves for all considered methods with end-to-end training and evaluation.
Figure 5: High-level overview of the training process and the data pipeline for the proposed methodology to learn representations of DNS queries. The colour encodes the function of the stage: input and output (green), data (yellow), operations (red) and learnable neural networks (blue).
...and 9 more figures

DNS-GT: A Graph-based Transformer Approach to Learn Embeddings of Domain Names from DNS Queries

Abstract

DNS-GT: A Graph-based Transformer Approach to Learn Embeddings of Domain Names from DNS Queries

Authors

Abstract

Table of Contents

Figures (14)