Intrusion Detection at Scale with the Assistance of a Command-line Language Model
Jiongliang Lin, Yiwen Guo, Hao Chen
TL;DR
This work addresses intrusion detection at cloud scale by pre-training a transformer-based command-line language model on tens of millions of logs in a self-supervised fashion, yielding rich embeddings of user command lines. It combines unsupervised anomaly detection with several supervision-guided tuning strategies—reconstruction-based, classification-based, multi-line classification, and retrieval-based methods—to detect intrusions. On a large production dataset (30 million training, 10 million test), the approaches achieve high precision and recall, with classification-based tuning delivering the strongest overall performance and surpassing a commercial IDS on several metrics. The results demonstrate the practicality of deploying data-driven IDS at scale in real cloud environments, with potential for ensemble improvements and broader applicability.
Abstract
Intrusion detection is a long standing and crucial problem in security. A system capable of detecting intrusions automatically is on great demand in enterprise security solutions. Existing solutions rely heavily on hand-crafted rules designed by security operators, which suffer from high false negative rates and poor generalization ability to new, zero-day attacks at scale. AI and machine learning offer promising solutions to address the issues, by inspecting abnormal user behaviors intelligently and automatically from data. However, existing learning-based intrusion detection systems in the literature are mostly designed for small data, and they lack the ability to leverage the power of big data in cloud environments. In this paper, we target at this problem and introduce an intrusion detection system which incorporates large-scale pre-training, so as to train a large language model based on tens of millions of command lines for AI-based intrusion detection. Experiments performed on 30 million training samples and 10 million test samples verify the effectiveness of our solution.
