Table of Contents
Fetching ...

SCORE: Syntactic Code Representations for Static Script Malware Detection

Ecenaz Erdemir, Kyuhong Park, Michael J. Morais, Vianne R. Gao, Marion Marschalek, Yi Fan

TL;DR

This paper proposes novel feature extraction and deep learning (DL)-based approaches for static script malware detection, targeting server-side threats, and demonstrates its effectiveness in learning code maliciousness for accurate detection of script malware.

Abstract

As businesses increasingly adopt cloud technologies, they also need to be aware of new security challenges, such as server-side script attacks, to ensure the integrity of their systems and data. These scripts can steal data, compromise credentials, and disrupt operations. Unlike executables with standardized formats (e.g., ELF, PE), scripts are plaintext files with diverse syntax, making them harder to detect using traditional methods. As a result, more sophisticated approaches are needed to protect cloud infrastructures from these evolving threats. In this paper, we propose novel feature extraction and deep learning (DL)-based approaches for static script malware detection, targeting server-side threats. We extract features from plain-text code using two techniques: syntactic code highlighting (SCH) and abstract syntax tree (AST) construction. SCH leverages complex regexes to parse syntactic elements of code, such as keywords, variable names, etc. ASTs generate a hierarchical representation of a program's syntactic structure. We then propose a sequential and a graph-based model that exploits these feature representations to detect script malware. We evaluate our approach on more than 400K server-side scripts in Bash, Python and Perl. We use a balanced dataset of 90K scripts for training, validation, and testing, with the remaining from 400K reserved for further analysis. Experiments show that our method achieves a true positive rate (TPR) up to 81% higher than leading signature-based antivirus solutions, while maintaining a low false positive rate (FPR) of 0.17%. Moreover, our approach outperforms various neural network-based detectors, demonstrating its effectiveness in learning code maliciousness for accurate detection of script malware.

SCORE: Syntactic Code Representations for Static Script Malware Detection

TL;DR

This paper proposes novel feature extraction and deep learning (DL)-based approaches for static script malware detection, targeting server-side threats, and demonstrates its effectiveness in learning code maliciousness for accurate detection of script malware.

Abstract

As businesses increasingly adopt cloud technologies, they also need to be aware of new security challenges, such as server-side script attacks, to ensure the integrity of their systems and data. These scripts can steal data, compromise credentials, and disrupt operations. Unlike executables with standardized formats (e.g., ELF, PE), scripts are plaintext files with diverse syntax, making them harder to detect using traditional methods. As a result, more sophisticated approaches are needed to protect cloud infrastructures from these evolving threats. In this paper, we propose novel feature extraction and deep learning (DL)-based approaches for static script malware detection, targeting server-side threats. We extract features from plain-text code using two techniques: syntactic code highlighting (SCH) and abstract syntax tree (AST) construction. SCH leverages complex regexes to parse syntactic elements of code, such as keywords, variable names, etc. ASTs generate a hierarchical representation of a program's syntactic structure. We then propose a sequential and a graph-based model that exploits these feature representations to detect script malware. We evaluate our approach on more than 400K server-side scripts in Bash, Python and Perl. We use a balanced dataset of 90K scripts for training, validation, and testing, with the remaining from 400K reserved for further analysis. Experiments show that our method achieves a true positive rate (TPR) up to 81% higher than leading signature-based antivirus solutions, while maintaining a low false positive rate (FPR) of 0.17%. Moreover, our approach outperforms various neural network-based detectors, demonstrating its effectiveness in learning code maliciousness for accurate detection of script malware.

Paper Structure

This paper contains 28 sections, 7 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: System overview: Our system first parses scripts for their SCORE-H or SCORE-T features. In the case of SCORE-H, our SM is fed with the serialized input and generates malicious or benign verdict. In the case of SCORE-T, ASTs are either traversed and fed to the SM or fed directly to a GRL model along with their adjacency matrix to output a malicious or benign verdict.
  • Figure 2: Exemplars of (a) Python script from Legion, a credential harvester, (b) its abstract syntax tree and (b) syntactic code highlighting expressions.
  • Figure 3: Syntax highlighting ( left) and byte ( right) feature embedding module for SCORE-H model. The SCORE-T model uses the same byte-string embedding submodule, but only a subset of the convolutional filters shown in the scope (node-name, resp.) embedding submodule.
  • Figure 4: SM architecture. Joint byte-syntactic features are extracted from syntax highlighting intermediates for SCORE-H (red, left) or ASTs for SCORE-T (orange, right). For either featureset, embeddings of syntactic features and bytes are concatenated into a sequence of inputs to a bi-LSTM RNN classifier, which yields malicious or benign verdicts.
  • Figure 5: Number of scripts with respect to (a) file size and (b) byte entropy in the training set.