Table of Contents
Fetching ...

Beyond Detection: A Comprehensive Benchmark and Study on Representation Learning for Fine-Grained Webshell Family Classification

Feijiang Han

TL;DR

This work addresses the need for fine-grained WebShell family classification by introducing the first large-scale benchmark that links raw dynamic function call traces to diverse representations. It decouples representation learning from classification and evaluates sequences, graphs, and trees, augmented with LLM-generated traces across four real-world datasets. The study finds that structural representations, especially tree-based pathways learned via GNNs (notably GAT), consistently outperform sequential approaches and are more robust to obfuscation, offering practical guidance for threat intelligence and rapid incident response. Overall, the paper provides a robust baseline, actionable implementation strategies, and a shift toward proactive defense beyond simple detection in critical infrastructure security.

Abstract

Malicious WebShells pose a significant and evolving threat by compromising critical digital infrastructures and endangering public services in sectors such as healthcare and finance. While the research community has made significant progress in WebShell detection (i.e., distinguishing malicious samples from benign ones), we argue that it is time to transition from passive detection to in-depth analysis and proactive defense. One promising direction is the automation of WebShell family classification, which involves identifying the specific malware lineage in order to understand an adversary's tactics and enable a precise, rapid response. This crucial task, however, remains a largely unexplored area that currently relies on slow, manual expert analysis. To address this gap, we present the first systematic study to automate WebShell family classification. Our method begins with extracting dynamic function call traces to capture inherent behaviors that are resistant to common encryption and obfuscation. To enhance the scale and diversity of our dataset for a more stable evaluation, we augment these real-world traces with new variants synthesized by Large Language Models. These augmented traces are then abstracted into sequences, graphs, and trees, providing a foundation to benchmark a comprehensive suite of representation methods. Our evaluation spans classic sequence-based embeddings (CBOW, GloVe), transformers (BERT, SimCSE), and a range of structure-aware algorithms, including Graph Kernels, Graph Edit Distance, Graph2Vec, and various Graph Neural Networks. Through extensive experiments on four real-world, family-annotated datasets under both supervised and unsupervised settings, we establish a robust baseline and provide practical insights into the most effective combinations of data abstractions, representation models, and learning paradigms for this challenge.

Beyond Detection: A Comprehensive Benchmark and Study on Representation Learning for Fine-Grained Webshell Family Classification

TL;DR

This work addresses the need for fine-grained WebShell family classification by introducing the first large-scale benchmark that links raw dynamic function call traces to diverse representations. It decouples representation learning from classification and evaluates sequences, graphs, and trees, augmented with LLM-generated traces across four real-world datasets. The study finds that structural representations, especially tree-based pathways learned via GNNs (notably GAT), consistently outperform sequential approaches and are more robust to obfuscation, offering practical guidance for threat intelligence and rapid incident response. Overall, the paper provides a robust baseline, actionable implementation strategies, and a shift toward proactive defense beyond simple detection in critical infrastructure security.

Abstract

Malicious WebShells pose a significant and evolving threat by compromising critical digital infrastructures and endangering public services in sectors such as healthcare and finance. While the research community has made significant progress in WebShell detection (i.e., distinguishing malicious samples from benign ones), we argue that it is time to transition from passive detection to in-depth analysis and proactive defense. One promising direction is the automation of WebShell family classification, which involves identifying the specific malware lineage in order to understand an adversary's tactics and enable a precise, rapid response. This crucial task, however, remains a largely unexplored area that currently relies on slow, manual expert analysis. To address this gap, we present the first systematic study to automate WebShell family classification. Our method begins with extracting dynamic function call traces to capture inherent behaviors that are resistant to common encryption and obfuscation. To enhance the scale and diversity of our dataset for a more stable evaluation, we augment these real-world traces with new variants synthesized by Large Language Models. These augmented traces are then abstracted into sequences, graphs, and trees, providing a foundation to benchmark a comprehensive suite of representation methods. Our evaluation spans classic sequence-based embeddings (CBOW, GloVe), transformers (BERT, SimCSE), and a range of structure-aware algorithms, including Graph Kernels, Graph Edit Distance, Graph2Vec, and various Graph Neural Networks. Through extensive experiments on four real-world, family-annotated datasets under both supervised and unsupervised settings, we establish a robust baseline and provide practical insights into the most effective combinations of data abstractions, representation models, and learning paradigms for this challenge.

Paper Structure

This paper contains 51 sections, 1 equation, 5 figures, 7 tables.

Figures (5)

  • Figure 1: The visualization of three data abstractions. (a) The Sequence Model visualizes the chronological execution flow. (b) The Graph Model provides a static, aggregate view of all calling relationships. (c) The Tree Model preserves the hierarchical call structure and execution context.
  • Figure 2: Performance comparison of representation methods on the DS4 dataset. Columns denote classifiers (KM: K-Means; MS: Mean-Shift; RF: Random Forest; SVM) and metrics.
  • Figure C.1: Performance comparison of all representation methods on the DS1 dataset.
  • Figure C.2: Performance comparison of all representation methods on the DS2 dataset.
  • Figure C.3: Performance comparison of representation methods on the DS3 dataset.