Command-line Obfuscation Detection using Small Language Models
Vojtech Outrata, Michael Adam Polak, Martin Kopp
TL;DR
The paper tackles the challenge of detecting adversarial command-line obfuscation across multiple LOLBins by training a small transformer model from scratch using a full pipeline: data preprocessing to normalize command lines, a custom WordPiece tokenizer, ELECTRA-style pre-training, and focal-loss fine-tuning to address severe class imbalance. Evaluated on large-scale Cisco telemetry, the approach yields high-precision detections and demonstrates the ability to identify unseen obfuscation techniques, surpassing signature-based methods. A detailed case study on Raspberry Robin and Gamarue illustrates practical gains, including detections of obfuscated patterns that would evade traditional detectors. The work highlights scalable, cross-LOLBin obfuscation detection with fast inference, and points to future work extending support to Unix-based environments.
Abstract
To avoid detection, adversaries often use command-line obfuscation. There are numerous techniques of the command-line obfuscation, all designed to alter the command-line syntax without affecting its original functionality. This variability forces most security solutions to create an exhaustive enumeration of signatures for even a single pattern. In contrast to using signatures, we have implemented a scalable NLP-based detection method that leverages a custom-trained, small transformer language model that can be applied to any source of execution logs. The evaluation on top of real-world telemetry demonstrates that our approach yields high-precision detections even on high-volume telemetry from a diverse set of environments spanning from universities and businesses to healthcare or finance. The practical value is demonstrated in a case study of real-world samples detected by our model. We show the model's superiority to signatures on established malware known to employ obfuscation and showcase previously unseen obfuscated samples detected by our model.
