Table of Contents
Fetching ...

Command-line Obfuscation Detection using Small Language Models

Vojtech Outrata, Michael Adam Polak, Martin Kopp

TL;DR

The paper tackles the challenge of detecting adversarial command-line obfuscation across multiple LOLBins by training a small transformer model from scratch using a full pipeline: data preprocessing to normalize command lines, a custom WordPiece tokenizer, ELECTRA-style pre-training, and focal-loss fine-tuning to address severe class imbalance. Evaluated on large-scale Cisco telemetry, the approach yields high-precision detections and demonstrates the ability to identify unseen obfuscation techniques, surpassing signature-based methods. A detailed case study on Raspberry Robin and Gamarue illustrates practical gains, including detections of obfuscated patterns that would evade traditional detectors. The work highlights scalable, cross-LOLBin obfuscation detection with fast inference, and points to future work extending support to Unix-based environments.

Abstract

To avoid detection, adversaries often use command-line obfuscation. There are numerous techniques of the command-line obfuscation, all designed to alter the command-line syntax without affecting its original functionality. This variability forces most security solutions to create an exhaustive enumeration of signatures for even a single pattern. In contrast to using signatures, we have implemented a scalable NLP-based detection method that leverages a custom-trained, small transformer language model that can be applied to any source of execution logs. The evaluation on top of real-world telemetry demonstrates that our approach yields high-precision detections even on high-volume telemetry from a diverse set of environments spanning from universities and businesses to healthcare or finance. The practical value is demonstrated in a case study of real-world samples detected by our model. We show the model's superiority to signatures on established malware known to employ obfuscation and showcase previously unseen obfuscated samples detected by our model.

Command-line Obfuscation Detection using Small Language Models

TL;DR

The paper tackles the challenge of detecting adversarial command-line obfuscation across multiple LOLBins by training a small transformer model from scratch using a full pipeline: data preprocessing to normalize command lines, a custom WordPiece tokenizer, ELECTRA-style pre-training, and focal-loss fine-tuning to address severe class imbalance. Evaluated on large-scale Cisco telemetry, the approach yields high-precision detections and demonstrates the ability to identify unseen obfuscation techniques, surpassing signature-based methods. A detailed case study on Raspberry Robin and Gamarue illustrates practical gains, including detections of obfuscated patterns that would evade traditional detectors. The work highlights scalable, cross-LOLBin obfuscation detection with fast inference, and points to future work extending support to Unix-based environments.

Abstract

To avoid detection, adversaries often use command-line obfuscation. There are numerous techniques of the command-line obfuscation, all designed to alter the command-line syntax without affecting its original functionality. This variability forces most security solutions to create an exhaustive enumeration of signatures for even a single pattern. In contrast to using signatures, we have implemented a scalable NLP-based detection method that leverages a custom-trained, small transformer language model that can be applied to any source of execution logs. The evaluation on top of real-world telemetry demonstrates that our approach yields high-precision detections even on high-volume telemetry from a diverse set of environments spanning from universities and businesses to healthcare or finance. The practical value is demonstrated in a case study of real-world samples detected by our model. We show the model's superiority to signatures on established malware known to employ obfuscation and showcase previously unseen obfuscated samples detected by our model.
Paper Structure (24 sections, 1 equation, 11 figures, 8 tables)

This paper contains 24 sections, 1 equation, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Tokenization of raw command-line compared to tokenization of preprocessed command-line by the same tokenizer. The GUID pattern in the raw command-line is tokenized into many non-meaningful tokens. In contrast, a single general token represents the pattern in the tokenized preprocessed command-line.
  • Figure 2: The $\gamma$ parameter controls focus on already well-classified examples. The cross-entropy loss is identical to FL with $\gamma$ set to 0. Image source: focal_loss.
  • Figure 3: Comparison of the produced number of tokens for out-of-domain tokenizer and custom tokenizers with various vocabulary sizes on the evaluation dataset.
  • Figure 4: One execution log tokenized by out-of-domain tokenizer (BERT TOKENIZER) and two custom tokenizers for command-line data with vocabulary sizes 1k and 20k, respectively.
  • Figure 5: Training loss curves for both pre-trained and non-pre-trained small models on the fine-tuning dataset. The non-pre-trained model shows problematic instability during training as compared to the pre-trained model.
  • ...and 6 more figures