Table of Contents
Fetching ...

Can LLMs Handle WebShell Detection? Overcoming Detection Challenges with Behavioral Function-Aware Framework

Feijiang Han, Jiaming Zhang, Chuyi Deng, Jianheng Tang, Yunhuai Liu

TL;DR

The paper investigates whether large language models (LLMs) can effectively detect WebShells in PHP code, a task complicated by obfuscation and long code contexts. It benchmarks seven LLMs against traditional machine learning and graph-based methods on a large PHP dataset and introduces the Behavioral Function-Aware Detection (BFAD) framework to address domain-specific detection challenges. BFAD, consisting of a Critical Function Filter, Context-Aware Code Extraction, and Weighted Behavioral Function Profiling, significantly improves LLM performance, yielding an average F1 gain of 13.82% and enabling large models to surpass state-of-the-art baselines while making smaller models competitive. The work demonstrates that focusing on behaviorally relevant code regions and carefully selecting in-context demonstrations can unlock the potential of LLMs for WebShell detection, with practical implications for scalable, interpretable, and efficient cybersecurity tooling. It also outlines limitations and future directions, including dataset diversity, robustness against advanced obfuscation, and the integration of dynamic analysis for long-term resilience.

Abstract

WebShell attacks, where malicious scripts are injected into web servers, pose a significant cybersecurity threat. Traditional ML and DL methods are often hampered by challenges such as the need for extensive training data, catastrophic forgetting, and poor generalization. Recently, Large Language Models have emerged as powerful alternatives for code-related tasks, but their potential in WebShell detection remains underexplored. In this paper, we make two contributions: (1) a comprehensive evaluation of seven LLMs, including GPT-4, LLaMA 3.1 70B, and Qwen 2.5 variants, benchmarked against traditional sequence- and graph-based methods using a dataset of 26.59K PHP scripts, and (2) the Behavioral Function-Aware Detection (BFAD) framework, designed to address the specific challenges of applying LLMs to this domain. Our framework integrates three components: a Critical Function Filter that isolates malicious PHP function calls, a Context-Aware Code Extraction strategy that captures the most behaviorally indicative code segments, and Weighted Behavioral Function Profiling that enhances in-context learning by prioritizing the most relevant demonstrations based on discriminative function-level profiles. Our results show that, stemming from their distinct analytical strategies, larger LLMs achieve near-perfect precision but lower recall, while smaller models exhibit the opposite trade-off. However, all baseline models lag behind previous SOTA methods. With the application of BFAD, the performance of all LLMs improves significantly, yielding an average F1 score increase of 13.82%. Notably, larger models now outperform SOTA benchmarks, while smaller models such as Qwen-2.5-Coder-3B achieve performance competitive with traditional methods. This work is the first to explore the feasibility and limitations of LLMs for WebShell detection and provides solutions to address the challenges in this task.

Can LLMs Handle WebShell Detection? Overcoming Detection Challenges with Behavioral Function-Aware Framework

TL;DR

The paper investigates whether large language models (LLMs) can effectively detect WebShells in PHP code, a task complicated by obfuscation and long code contexts. It benchmarks seven LLMs against traditional machine learning and graph-based methods on a large PHP dataset and introduces the Behavioral Function-Aware Detection (BFAD) framework to address domain-specific detection challenges. BFAD, consisting of a Critical Function Filter, Context-Aware Code Extraction, and Weighted Behavioral Function Profiling, significantly improves LLM performance, yielding an average F1 gain of 13.82% and enabling large models to surpass state-of-the-art baselines while making smaller models competitive. The work demonstrates that focusing on behaviorally relevant code regions and carefully selecting in-context demonstrations can unlock the potential of LLMs for WebShell detection, with practical implications for scalable, interpretable, and efficient cybersecurity tooling. It also outlines limitations and future directions, including dataset diversity, robustness against advanced obfuscation, and the integration of dynamic analysis for long-term resilience.

Abstract

WebShell attacks, where malicious scripts are injected into web servers, pose a significant cybersecurity threat. Traditional ML and DL methods are often hampered by challenges such as the need for extensive training data, catastrophic forgetting, and poor generalization. Recently, Large Language Models have emerged as powerful alternatives for code-related tasks, but their potential in WebShell detection remains underexplored. In this paper, we make two contributions: (1) a comprehensive evaluation of seven LLMs, including GPT-4, LLaMA 3.1 70B, and Qwen 2.5 variants, benchmarked against traditional sequence- and graph-based methods using a dataset of 26.59K PHP scripts, and (2) the Behavioral Function-Aware Detection (BFAD) framework, designed to address the specific challenges of applying LLMs to this domain. Our framework integrates three components: a Critical Function Filter that isolates malicious PHP function calls, a Context-Aware Code Extraction strategy that captures the most behaviorally indicative code segments, and Weighted Behavioral Function Profiling that enhances in-context learning by prioritizing the most relevant demonstrations based on discriminative function-level profiles. Our results show that, stemming from their distinct analytical strategies, larger LLMs achieve near-perfect precision but lower recall, while smaller models exhibit the opposite trade-off. However, all baseline models lag behind previous SOTA methods. With the application of BFAD, the performance of all LLMs improves significantly, yielding an average F1 score increase of 13.82%. Notably, larger models now outperform SOTA benchmarks, while smaller models such as Qwen-2.5-Coder-3B achieve performance competitive with traditional methods. This work is the first to explore the feasibility and limitations of LLMs for WebShell detection and provides solutions to address the challenges in this task.

Paper Structure

This paper contains 31 sections, 5 equations, 1 figure, 8 tables, 1 algorithm.

Figures (1)

  • Figure 1: Overview of the Behavioral Function-Aware Detection framework for WebShell detection. It consists of three components: (a) Critical Function Filter, which identifies PHP functions associated with malicious behavior; (b) Context-Aware Code Extraction, which isolates critical code regions to overcome LLM context limitations; and (c) Weighted Behavioral Function Profiling, which selects ICL demonstrations using a behavior-weighted similarity score.