Table of Contents
Fetching ...

Case Study: Fine-tuning Small Language Models for Accurate and Private CWE Detection in Python Code

Md. Azizul Hakim Bappy, Hossen A Mustafa, Prottoy Saha, Rajinus Salehat

TL;DR

This work tackles the privacy and cost barriers of cloud-based LLMs for CWE detection by tuning a 350M-parameter Small Language Model (codegen-mono) on Python code. It combines a semi-supervised, LLM-assisted data generation pipeline with rigorous human validation and an instruction-following fine-tuning strategy, achieving near-perfect CWE detection performance (≈99% accuracy, 98.08% precision, 100% recall, 99.04% F1) on a held-out set. Baseline experiments show the un-tuned model cannot detect CWEs, underscoring the value of task-specific fine-tuning. The results support a practical, on-premise, privacy-preserving approach for integrating CWE detection directly into software development workflows, with implications for secure, low-resource deployments and broader language/ CWE coverage in future work.

Abstract

Large Language Models (LLMs) have demonstrated significant capabilities in understanding and analyzing code for security vulnerabilities, such as Common Weakness Enumerations (CWEs). However, their reliance on cloud infrastructure and substantial computational requirements pose challenges for analyzing sensitive or proprietary codebases due to privacy concerns and inference costs. This work explores the potential of Small Language Models (SLMs) as a viable alternative for accurate, on-premise vulnerability detection. We investigated whether a 350-million parameter pre-trained code model (codegen-mono) could be effectively fine-tuned to detect the MITRE Top 25 CWEs specifically within Python code. To facilitate this, we developed a targeted dataset of 500 examples using a semi-supervised approach involving LLM-driven synthetic data generation coupled with meticulous human review. Initial tests confirmed that the base codegen-mono model completely failed to identify CWEs in our samples. However, after applying instruction-following fine-tuning, the specialized SLM achieved remarkable performance on our test set, yielding approximately 99% accuracy, 98.08% precision, 100% recall, and a 99.04% F1-score. These results strongly suggest that fine-tuned SLMs can serve as highly accurate and efficient tools for CWE detection, offering a practical and privacy-preserving solution for integrating advanced security analysis directly into development workflows.

Case Study: Fine-tuning Small Language Models for Accurate and Private CWE Detection in Python Code

TL;DR

This work tackles the privacy and cost barriers of cloud-based LLMs for CWE detection by tuning a 350M-parameter Small Language Model (codegen-mono) on Python code. It combines a semi-supervised, LLM-assisted data generation pipeline with rigorous human validation and an instruction-following fine-tuning strategy, achieving near-perfect CWE detection performance (≈99% accuracy, 98.08% precision, 100% recall, 99.04% F1) on a held-out set. Baseline experiments show the un-tuned model cannot detect CWEs, underscoring the value of task-specific fine-tuning. The results support a practical, on-premise, privacy-preserving approach for integrating CWE detection directly into software development workflows, with implications for secure, low-resource deployments and broader language/ CWE coverage in future work.

Abstract

Large Language Models (LLMs) have demonstrated significant capabilities in understanding and analyzing code for security vulnerabilities, such as Common Weakness Enumerations (CWEs). However, their reliance on cloud infrastructure and substantial computational requirements pose challenges for analyzing sensitive or proprietary codebases due to privacy concerns and inference costs. This work explores the potential of Small Language Models (SLMs) as a viable alternative for accurate, on-premise vulnerability detection. We investigated whether a 350-million parameter pre-trained code model (codegen-mono) could be effectively fine-tuned to detect the MITRE Top 25 CWEs specifically within Python code. To facilitate this, we developed a targeted dataset of 500 examples using a semi-supervised approach involving LLM-driven synthetic data generation coupled with meticulous human review. Initial tests confirmed that the base codegen-mono model completely failed to identify CWEs in our samples. However, after applying instruction-following fine-tuning, the specialized SLM achieved remarkable performance on our test set, yielding approximately 99% accuracy, 98.08% precision, 100% recall, and a 99.04% F1-score. These results strongly suggest that fine-tuned SLMs can serve as highly accurate and efficient tools for CWE detection, offering a practical and privacy-preserving solution for integrating advanced security analysis directly into development workflows.

Paper Structure

This paper contains 17 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Semi-Supervised Dataset Creation and Fine-Tuning Pipeline
  • Figure 2: Dataset Example