Table of Contents
Fetching ...

K-ASTRO: Structure-Aware Adaptation of LLMs for Code Vulnerability Detection

Yifan Zhang, Michael Sandborn, Stefan Larson, Yu Huang, Kevin Leach

TL;DR

K-ASTRO tackles code vulnerability detection by marrying AST-based structural cues with semantic LLM embeddings in a single lightweight Transformer. It introduces Diversity-Introducing AST Augmentation, Structure-Aware Attention Bias, and Joint LLM Adaptation to fuse syntax and semantics efficiently. Across BigVul, DiverseVul, and PrimeVul, K-ASTRO achieves state-of-the-art performance with roughly 1M parameters (~4MB) and rapid CPU inference, outperforming larger off-the-shelf LLMs and several baselines. The work provides open-source tools and suggests practical impact for secure, scalable vulnerability detection in resource-constrained environments.

Abstract

Large Language Models (LLMs) are transforming software engineering tasks, including code vulnerability detection-a critical area of software security. However, existing methods often rely on resource-intensive models or graph-based techniques, limiting their accessibility and practicality. This paper introduces K-ASTRO, a lightweight Transformer model that combines semantic embeddings from LLMs with structural features of Abstract Syntax Trees (ASTs) to improve both efficiency and accuracy in code vulnerability detection. Our approach introduces an AST-based augmentation technique inspired by mutation testing, a structure-aware attention mechanism that incorporates augmented AST features, and a joint adaptation pipeline to unify code semantics and syntax. Experimental results on three large-scale datasets, including BigVul, DiverseVul, and PrimeVul-demonstrate state-of-the-art performance while enabling rapid inference on CPUs with minimal training time. By offering a scalable, interpretable, and efficient solution, K-ASTRO bridges the gap between LLM advancements and practical software vulnerability detection, providing open-sourced tools to foster further research.

K-ASTRO: Structure-Aware Adaptation of LLMs for Code Vulnerability Detection

TL;DR

K-ASTRO tackles code vulnerability detection by marrying AST-based structural cues with semantic LLM embeddings in a single lightweight Transformer. It introduces Diversity-Introducing AST Augmentation, Structure-Aware Attention Bias, and Joint LLM Adaptation to fuse syntax and semantics efficiently. Across BigVul, DiverseVul, and PrimeVul, K-ASTRO achieves state-of-the-art performance with roughly 1M parameters (~4MB) and rapid CPU inference, outperforming larger off-the-shelf LLMs and several baselines. The work provides open-source tools and suggests practical impact for secure, scalable vulnerability detection in resource-constrained environments.

Abstract

Large Language Models (LLMs) are transforming software engineering tasks, including code vulnerability detection-a critical area of software security. However, existing methods often rely on resource-intensive models or graph-based techniques, limiting their accessibility and practicality. This paper introduces K-ASTRO, a lightweight Transformer model that combines semantic embeddings from LLMs with structural features of Abstract Syntax Trees (ASTs) to improve both efficiency and accuracy in code vulnerability detection. Our approach introduces an AST-based augmentation technique inspired by mutation testing, a structure-aware attention mechanism that incorporates augmented AST features, and a joint adaptation pipeline to unify code semantics and syntax. Experimental results on three large-scale datasets, including BigVul, DiverseVul, and PrimeVul-demonstrate state-of-the-art performance while enabling rapid inference on CPUs with minimal training time. By offering a scalable, interpretable, and efficient solution, K-ASTRO bridges the gap between LLM advancements and practical software vulnerability detection, providing open-sourced tools to foster further research.
Paper Structure (26 sections, 7 equations, 5 figures, 4 tables)

This paper contains 26 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Example AST. AST representation of lines 6, 7, and 9 of Listing \ref{['lst:vuln_code_sample']}, parsed with Clang 14.0 and visualized with Graphviz.
  • Figure 2: Overview of K-ASTRO. The framework processes source code into semantic embeddings via LLMs and structural embeddings via AST augmentation. Augmented ASTs provide structural insights through a structure-aware attention mechanism, which is combined with LLM embeddings for final vulnerability prediction using a single lightweight Transformer block.
  • Figure 3: AST Augmentation Pipeline. The process involves AST generation, subtree extraction, and augmentation through node replacement to create structurally diverse representations for vulnerability detection.
  • Figure 4: Structure-Aware Attention Mechanism. The mechanism incorporates co-occurrence matrices from ASTs into the Transformer’s attention map, enhancing its structural understanding.
  • Figure 5: Visualization of Augmented ASTs. Adjacency matrices generated during the AST augmentation process, where subtrees from functions with corresponding vulnerability labels replace selected nodes in the original ASTs. The original matrix (left column) and $K=8$ augmented matrices (right columns) depict node connectivity in the resulting ASTs. We consider a total of $n=50$ node kinds, as observed in the datasets using Clang. These structures are incorporated into K-ASTRO via sparse attention to enhance vulnerability prediction performance (Section \ref{['sec:structure_aware']}). We set $K=4$ in our experiments to balance feature diversity and experimentation speed.