Table of Contents
Fetching ...

EnStack: An Ensemble Stacking Framework of Large Language Models for Enhanced Vulnerability Detection in Source Code

Shahriyar Zaman Ridoy, Md. Shazzad Hossain Shaon, Alfredo Cuzzocrea, Mst Shapna Akter

TL;DR

This paper proposes a novel ensemble stacking approach that synergizes multiple pre-trained large language models—CodeBERT, GraphCodeBERT, and UniXcoder—to improve vulnerability detection in source code, and demonstrates significant performance gains over existing methods.

Abstract

Automated detection of software vulnerabilities is critical for enhancing security, yet existing methods often struggle with the complexity and diversity of modern codebases. In this paper, we introduce EnStack, a novel ensemble stacking framework that enhances vulnerability detection using natural language processing (NLP) techniques. Our approach synergizes multiple pre-trained large language models (LLMs) specialized in code understanding CodeBERT for semantic analysis, GraphCodeBERT for structural representation, and UniXcoder for cross-modal capabilities. By fine-tuning these models on the Draper VDISC dataset and integrating their outputs through meta-classifiers such as Logistic Regression, Support Vector Machines (SVM), Random Forest, and XGBoost, EnStack effectively captures intricate code patterns and vulnerabilities that individual models may overlook. The meta-classifiers consolidate the strengths of each LLM, resulting in a comprehensive model that excels in detecting subtle and complex vulnerabilities across diverse programming contexts. Experimental results demonstrate that EnStack significantly outperforms existing methods, achieving notable improvements in accuracy, precision, recall, and F1-score. This work highlights the potential of ensemble LLM approaches in code analysis tasks and offers valuable insights into applying NLP techniques for advancing automated vulnerability detection.

EnStack: An Ensemble Stacking Framework of Large Language Models for Enhanced Vulnerability Detection in Source Code

TL;DR

This paper proposes a novel ensemble stacking approach that synergizes multiple pre-trained large language models—CodeBERT, GraphCodeBERT, and UniXcoder—to improve vulnerability detection in source code, and demonstrates significant performance gains over existing methods.

Abstract

Automated detection of software vulnerabilities is critical for enhancing security, yet existing methods often struggle with the complexity and diversity of modern codebases. In this paper, we introduce EnStack, a novel ensemble stacking framework that enhances vulnerability detection using natural language processing (NLP) techniques. Our approach synergizes multiple pre-trained large language models (LLMs) specialized in code understanding CodeBERT for semantic analysis, GraphCodeBERT for structural representation, and UniXcoder for cross-modal capabilities. By fine-tuning these models on the Draper VDISC dataset and integrating their outputs through meta-classifiers such as Logistic Regression, Support Vector Machines (SVM), Random Forest, and XGBoost, EnStack effectively captures intricate code patterns and vulnerabilities that individual models may overlook. The meta-classifiers consolidate the strengths of each LLM, resulting in a comprehensive model that excels in detecting subtle and complex vulnerabilities across diverse programming contexts. Experimental results demonstrate that EnStack significantly outperforms existing methods, achieving notable improvements in accuracy, precision, recall, and F1-score. This work highlights the potential of ensemble LLM approaches in code analysis tasks and offers valuable insights into applying NLP techniques for advancing automated vulnerability detection.

Paper Structure

This paper contains 15 sections, 2 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: A Comparative Overview of Vulnerability Detection Techniques. (1) Traditional LLM-based processing, which directly outputs predictions but experiences notable data loss, (2) Traditional+Meta models from previous studies that integrate a meta-classifier to enhance LLM outputs, and (3) the proposed EnStack framework, which leverages an ensemble of multiple LLMs combined through stacking methods. EnStack incorporates a meta-model to further refine predictions, aiming for improved accuracy in vulnerability detection by effectively combining strengths of various LLMs and meta-model architectures.
  • Figure 2: Methodology Workflow for Enhanced Vulnerability Detection using EnStack Framework. The proposed methodology begins with the CWE dataset, which includes multiclass labels associated with various types of vulnerable code. During the processing phase, data cleansing steps, such as null value removal and down sampling, ensure balanced and representative input data. This furnished data is then fed into an ensemble of large language models (LLMs), where features are extracted and combined to capture diverse vulnerability patterns. These combined features are passed through a stacking framework that employs various machine learning algorithms to enhance model robustness. The stacked model produces a final multiclass prediction, facilitating accurate and refined detection of code vulnerabilities across different classes.
  • Figure 3: t-SNE visualization of latent space representations for CWE categories in vulnerability detection. (a) baseline model fine-tuned using CodeBert, exhibiting class overlap and reduced inter-cluster separation; (b) enhanced representations through ensemble stacking (EnStack) with CodeBert and GraphCodeBERT (C+G) followed by logistic regression (LR), demonstrating improved cluster formation and class separability.
  • Figure 4: Ablation study results showcasing the impact of model combinations and meta-classifier choice in stacking ensembles. (a) Performance comparison of model combinations (C+G vs. G+U) across meta-classifiers (LR, RF, SVM, and XGBoost). G+U consistently outperforms C+G, with SVM achieving the highest accuracy of 82.36%. (b) Detailed analysis of meta-classifier performance on the G+U combination, highlighting the dominance of linear classifiers (SVM and LR) in capturing complementary features. XGBoost underperforms, indicating added complexity may not yield better results for this task.