Table of Contents
Fetching ...

Improving the Identification of Real-world Malware's DNS Covert Channels Using Locality Sensitive Hashing

Pascal Ruffing, Denis Petrov, Sebastian Zillien, Steffen Wendzel

TL;DR

This work tackles the problem of detecting and attributing DNS-based malware covert channels in real-world traffic. It proposes a novel LSH-based pipeline that encodes DNS subdomain sequences into similarity features via dual hashing, computes pairwise similarities in fixed windows, and feeds a Random Forest classifier for binary detection, malware-family identification, and behavioral classification; the distance computations leverage pairwise comparisons with $\binom{n}{2}=\frac{n(n-1)}{2}$ per segment, yielding $O(n^2)$ complexity (with potential $O(n)$ rolling updates in streaming setups). The approach demonstrates higher F1-scores and substantially lower false positive rates than the Domainator baseline, while generalizing to unseen malware variants and tools and enabling behavior-level attribution (e.g., upload/download/idle) based solely on DNS traffic. Limitations arise with low-variance idle traffic, suggesting future work in hybrid representations that incorporate temporal or semantic cues and extending the framework to other protocols for broader forensic applicability.

Abstract

Nowadays, malware increasingly uses DNS-based covert channels in order to evade detection and maintain stealthy communication with its command-and-control servers. While prior work has focused on detecting such activity, identifying specific malware families and their behaviors from captured network traffic remains challenging due to the variability of DNS. In this paper, we present the first application of Locality Sensitive Hashing to the detection and identification of real-world malware utilizing DNS covert channels. Our approach encodes DNS subdomain sequences into statistical similarity features that effectively capture anomalies indicative of malicious activity. Combined with a Random Forest classifier, our method achieves higher accuracy and reduced false positive rates than prior approaches, while demonstrating improved robustness and generalization to previously unseen or modified malware samples. We further demonstrate that our approach enables reliable classification of malware behavior (e.g., uploading or downloading of files), based solely on DNS subdomains.

Improving the Identification of Real-world Malware's DNS Covert Channels Using Locality Sensitive Hashing

TL;DR

This work tackles the problem of detecting and attributing DNS-based malware covert channels in real-world traffic. It proposes a novel LSH-based pipeline that encodes DNS subdomain sequences into similarity features via dual hashing, computes pairwise similarities in fixed windows, and feeds a Random Forest classifier for binary detection, malware-family identification, and behavioral classification; the distance computations leverage pairwise comparisons with per segment, yielding complexity (with potential rolling updates in streaming setups). The approach demonstrates higher F1-scores and substantially lower false positive rates than the Domainator baseline, while generalizing to unseen malware variants and tools and enabling behavior-level attribution (e.g., upload/download/idle) based solely on DNS traffic. Limitations arise with low-variance idle traffic, suggesting future work in hybrid representations that incorporate temporal or semantic cues and extending the framework to other protocols for broader forensic applicability.

Abstract

Nowadays, malware increasingly uses DNS-based covert channels in order to evade detection and maintain stealthy communication with its command-and-control servers. While prior work has focused on detecting such activity, identifying specific malware families and their behaviors from captured network traffic remains challenging due to the variability of DNS. In this paper, we present the first application of Locality Sensitive Hashing to the detection and identification of real-world malware utilizing DNS covert channels. Our approach encodes DNS subdomain sequences into statistical similarity features that effectively capture anomalies indicative of malicious activity. Combined with a Random Forest classifier, our method achieves higher accuracy and reduced false positive rates than prior approaches, while demonstrating improved robustness and generalization to previously unseen or modified malware samples. We further demonstrate that our approach enables reliable classification of malware behavior (e.g., uploading or downloading of files), based solely on DNS subdomains.

Paper Structure

This paper contains 26 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Overview of the machine learning pipeline for DNS tunneling detection and malware identification.
  • Figure 2: Pairwise LSH similarity relationships for grouped windows of DNS requests from the Training Set.
  • Figure 3: Binary classification across varying window sizes for all evaluated datasets.
  • Figure 4: Family classification across varying window sizes.
  • Figure 5: Confusion matrices for the family classification task at window size 20, across Training, Variant Set, and GraphTunnel-Known datasets.
  • ...and 1 more figures