Table of Contents
Fetching ...

Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models

Aidan Z. H. Yang, Haoye Tian, He Ye, Ruben Martins, Claire Le Goues

TL;DR

This work tackles the limitations of code-token–only LLMs in vulnerability detection by introducing MSIVD, a multitask self-instruct fine-tuning framework that leverages vulnerability explanations and data-flow information. By pairing a CodeLlama-13B-Instruct LLM with a DFA-based GNN adapter and training on multi-round dialogues that include vulnerability labels, descriptions, and fixes, MSIVD achieves state-of-the-art F1 scores on established BigVul data ($F1=0.92$) and demonstrates improved generalization to unseen vulnerabilities on PreciseBugs ($F1\approx0.48$). The approach leverages parameter-efficient fine-tuning (QLoRA/LoRA) and a weighted multi-task loss to balance detection and explanation objectives, revealing that explanations can substantially boost performance while mitigating overfitting. The work also highlights data leakage concerns in LLM evaluation and introduces a novel post-2023 vulnerability dataset to provide a more robust assessment, underscoring the practical impact of combining LLMs with graph-based program analysis for secure software engineering.

Abstract

Software security vulnerabilities allow attackers to perform malicious activities to disrupt software operations. Recent Transformer-based language models have significantly advanced vulnerability detection, surpassing the capabilities of static analysis based deep learning models. However, language models trained solely on code tokens do not capture either the explanation of vulnerability type or the data flow structure information of code, both of which are crucial for vulnerability detection. We propose a novel technique that integrates a multitask sequence-to-sequence LLM with pro-gram control flow graphs encoded as a graph neural network to achieve sequence-to-classification vulnerability detection. We introduce MSIVD, multitask self-instructed fine-tuning for vulnerability detection, inspired by chain-of-thought prompting and LLM self-instruction. Our experiments demonstrate that MSIVD achieves superior performance, outperforming the highest LLM-based vulnerability detector baseline (LineVul), with a F1 score of 0.92 on the BigVul dataset, and 0.48 on the PreciseBugs dataset. By training LLMs and GNNs simultaneously using a combination of code and explanatory metrics of a vulnerable program, MSIVD represents a promising direction for advancing LLM-based vulnerability detection that generalizes to unseen data. Based on our findings, we further discuss the necessity for new labelled security vulnerability datasets, as recent LLMs have seen or memorized prior datasets' held-out evaluation data.

Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models

TL;DR

This work tackles the limitations of code-token–only LLMs in vulnerability detection by introducing MSIVD, a multitask self-instruct fine-tuning framework that leverages vulnerability explanations and data-flow information. By pairing a CodeLlama-13B-Instruct LLM with a DFA-based GNN adapter and training on multi-round dialogues that include vulnerability labels, descriptions, and fixes, MSIVD achieves state-of-the-art F1 scores on established BigVul data () and demonstrates improved generalization to unseen vulnerabilities on PreciseBugs (). The approach leverages parameter-efficient fine-tuning (QLoRA/LoRA) and a weighted multi-task loss to balance detection and explanation objectives, revealing that explanations can substantially boost performance while mitigating overfitting. The work also highlights data leakage concerns in LLM evaluation and introduces a novel post-2023 vulnerability dataset to provide a more robust assessment, underscoring the practical impact of combining LLMs with graph-based program analysis for secure software engineering.

Abstract

Software security vulnerabilities allow attackers to perform malicious activities to disrupt software operations. Recent Transformer-based language models have significantly advanced vulnerability detection, surpassing the capabilities of static analysis based deep learning models. However, language models trained solely on code tokens do not capture either the explanation of vulnerability type or the data flow structure information of code, both of which are crucial for vulnerability detection. We propose a novel technique that integrates a multitask sequence-to-sequence LLM with pro-gram control flow graphs encoded as a graph neural network to achieve sequence-to-classification vulnerability detection. We introduce MSIVD, multitask self-instructed fine-tuning for vulnerability detection, inspired by chain-of-thought prompting and LLM self-instruction. Our experiments demonstrate that MSIVD achieves superior performance, outperforming the highest LLM-based vulnerability detector baseline (LineVul), with a F1 score of 0.92 on the BigVul dataset, and 0.48 on the PreciseBugs dataset. By training LLMs and GNNs simultaneously using a combination of code and explanatory metrics of a vulnerable program, MSIVD represents a promising direction for advancing LLM-based vulnerability detection that generalizes to unseen data. Based on our findings, we further discuss the necessity for new labelled security vulnerability datasets, as recent LLMs have seen or memorized prior datasets' held-out evaluation data.
Paper Structure (29 sections, 2 equations, 5 figures, 4 tables)

This paper contains 29 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Example CWE-770 (allocation of resources without limits or throttling) vulnerability. MSIVD's multi-task fine-tuning uses as features all of the code, vulnerability description, exploitability score, severity, attack complexity, and vulnerable lines.
  • Figure 2: MSIVD’s architecture, which takes as training data a code snippet, its vulnerability label, and various human annotated vulnerability labels. MSIVD outputs a final vulnerability classification on unseen code snippets.
  • Figure 3: A single training data entry for MSIVD's vulnerability detection multi-task fine-tuning. The 4 rounds of dialogue between human and bot follows 4 different labelled data: vulnerability classification label, vulnerability description, vulnerability type, and vulnerability repair lines.
  • Figure 4: LoRA re-parameterization for efficient fine-tuning, where only $A$ and $B$ contain trainable parameters, and the initial pre-trained weights $W_0$ remain frozen.
  • Figure 5: Loss curve on MSIVD with BigVul and PreciseBugs. A lower loss value indicates model predictions that are closer to the ground-truth labels, and a near-zero loss indicates over-fitting. Note that we also run the exact experiment on the Devign dataset, and observe the same loss curve as BigVul without explanation.