Improving the Identification of Real-world Malware's DNS Covert Channels Using Locality Sensitive Hashing
Pascal Ruffing, Denis Petrov, Sebastian Zillien, Steffen Wendzel
TL;DR
This work tackles the problem of detecting and attributing DNS-based malware covert channels in real-world traffic. It proposes a novel LSH-based pipeline that encodes DNS subdomain sequences into similarity features via dual hashing, computes pairwise similarities in fixed windows, and feeds a Random Forest classifier for binary detection, malware-family identification, and behavioral classification; the distance computations leverage pairwise comparisons with $\binom{n}{2}=\frac{n(n-1)}{2}$ per segment, yielding $O(n^2)$ complexity (with potential $O(n)$ rolling updates in streaming setups). The approach demonstrates higher F1-scores and substantially lower false positive rates than the Domainator baseline, while generalizing to unseen malware variants and tools and enabling behavior-level attribution (e.g., upload/download/idle) based solely on DNS traffic. Limitations arise with low-variance idle traffic, suggesting future work in hybrid representations that incorporate temporal or semantic cues and extending the framework to other protocols for broader forensic applicability.
Abstract
Nowadays, malware increasingly uses DNS-based covert channels in order to evade detection and maintain stealthy communication with its command-and-control servers. While prior work has focused on detecting such activity, identifying specific malware families and their behaviors from captured network traffic remains challenging due to the variability of DNS. In this paper, we present the first application of Locality Sensitive Hashing to the detection and identification of real-world malware utilizing DNS covert channels. Our approach encodes DNS subdomain sequences into statistical similarity features that effectively capture anomalies indicative of malicious activity. Combined with a Random Forest classifier, our method achieves higher accuracy and reduced false positive rates than prior approaches, while demonstrating improved robustness and generalization to previously unseen or modified malware samples. We further demonstrate that our approach enables reliable classification of malware behavior (e.g., uploading or downloading of files), based solely on DNS subdomains.
