Distilling Lightweight Language Models for C/C++ Vulnerabilities
Zhiyuan Wei, Xiaoxuan Yang, Jing Sun, Zijian Zhang
TL;DR
This work tackles the challenge of robust vulnerability detection in C/C++ by leveraging knowledge distillation to train lightweight models that match or surpass large LLMs in real-world settings. FineSec combines a multi-agent data-distillation pipeline, parameter-efficient fine-tuning (QLoRA/LoRA), and instruction-aligned outputs to produce accurate, structured vulnerability reports with remediation guidance. It demonstrates that domain-specific distillation can yield substantial detection gains, discover previously undocumented vulnerabilities, and deliver deployment-ready models within resource-constrained environments. The framework is modular and extensible, enabling continuous learning and extending to additional languages and domains, with practical implications for scalable software security tooling.
Abstract
The increasing complexity of modern software systems exacerbates the prevalence of security vulnerabilities, posing risks of severe breaches and substantial economic loss. Consequently, robust code vulnerability detection is essential for software security. While Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing, their potential for automated code vulnerability detection remains underexplored. This paper presents FineSec, a novel framework that harnesses LLMs through knowledge distillation to enable efficient and precise vulnerability identification in C/C++ codebases. FineSec utilizes knowledge distillation to transfer expertise from large teacher models to compact student models, achieving high accuracy with minimal computational cost. By integrating data preparation, training, evaluation, and continuous learning into a unified, single-task workflow, FineSec offers a streamlined approach. Extensive evaluations on C/C++ codebases demonstrate its superiority over both base models and larger LLMs in identifying complex vulnerabilities and logical flaws, establishing FineSec as a practical and scalable solution for real-world software security. To facilitate reproducibility, the datasets, source code, and experimental results are made publicly available at: https://github.com/yangxiaoxuan123/FineSec_detect.
