Table of Contents
Fetching ...

When Labels Are Scarce: A Systematic Mapping of Label-Efficient Code Vulnerability Detection

Noor Khalal, Chakib Fettal, Lazhar Labiod, Mohamed Nadif

Abstract

Machine-learning-based code vulnerability detection (CVD) has progressed rapidly, from deep program representations to pretrained code models and LLM-centered pipelines. Yet dependable vulnerability labeling remains expensive, noisy, and uneven across projects, languages, and CWE types, motivating approaches that reduce reliance on human labeling. This survey maps these approaches, synthesizing five paradigm families and the mechanisms they use. It connects mechanisms to token, graph, hybrid, and knowledgebased representations, and consolidates evaluation and reporting axes that limit comparison (label-budget specification, compute/cost assumptions, leakage, and granularity mismatches). A Design Map and constraintfirst Decision Guide distill trade-offs and failure modes for practical method selection.

When Labels Are Scarce: A Systematic Mapping of Label-Efficient Code Vulnerability Detection

Abstract

Machine-learning-based code vulnerability detection (CVD) has progressed rapidly, from deep program representations to pretrained code models and LLM-centered pipelines. Yet dependable vulnerability labeling remains expensive, noisy, and uneven across projects, languages, and CWE types, motivating approaches that reduce reliance on human labeling. This survey maps these approaches, synthesizing five paradigm families and the mechanisms they use. It connects mechanisms to token, graph, hybrid, and knowledgebased representations, and consolidates evaluation and reporting axes that limit comparison (label-budget specification, compute/cost assumptions, leakage, and granularity mismatches). A Design Map and constraintfirst Decision Guide distill trade-offs and failure modes for practical method selection.

Paper Structure

This paper contains 88 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Study selection flow.
  • Figure 2: Annual publication counts of included studies by paradigm family for the Main pool (a) and the Inspiration pool (b). "Other" refers to Cross-cutting mechanisms and specialized settings.
  • Figure 3: Code representation trends in the Main pool.
  • Figure 4: Primitive-level intersections across representations.
  • Figure 5: Distribution of task types and their association with learning paradigms.
  • ...and 2 more figures