Table of Contents
Fetching ...

Retrieval-Augmented Few-Shot Prompting Versus Fine-Tuning for Code Vulnerability Detection

Fouad Trad, Ali Chehab

TL;DR

The paper examines retrieval-augmented few-shot prompting as a cost-effective alternative to fine-tuning for multi-label code vulnerability detection. It systematically compares three prompting strategies against zero-shot and fine-tuned baselines across Gemini-1.5-Flash and open-source models, showing that semantic retrieval of in-context examples consistently improves performance without training. While open-source fine-tuning (CodeBERT) can achieve higher absolute accuracy, retrieval-augmented prompting offers a favorable cost-performance balance, especially in low-resource settings. The findings underscore the value of example quality and retrieval in structuring in-context guidance for complex code-analysis tasks.

Abstract

Few-shot prompting has emerged as a practical alternative to fine-tuning for leveraging the capabilities of large language models (LLMs) in specialized tasks. However, its effectiveness depends heavily on the selection and quality of in-context examples, particularly in complex domains. In this work, we examine retrieval-augmented prompting as a strategy to improve few-shot performance in code vulnerability detection, where the goal is to identify one or more security-relevant weaknesses present in a given code snippet from a predefined set of vulnerability categories. We perform a systematic evaluation using the Gemini-1.5-Flash model across three approaches: (1) standard few-shot prompting with randomly selected examples, (2) retrieval-augmented prompting using semantically similar examples, and (3) retrieval-based labeling, which assigns labels based on retrieved examples without model inference. Our results show that retrieval-augmented prompting consistently outperforms the other prompting strategies. At 20 shots, it achieves an F1 score of 74.05% and a partial match accuracy of 83.90%. We further compare this approach against zero-shot prompting and several fine-tuned models, including Gemini-1.5-Flash and smaller open-source models such as DistilBERT, DistilGPT2, and CodeBERT. Retrieval-augmented prompting outperforms both zero-shot (F1 score: 36.35%, partial match accuracy: 20.30%) and fine-tuned Gemini (F1 score: 59.31%, partial match accuracy: 53.10%), while avoiding the training time and cost associated with model fine-tuning. On the other hand, fine-tuning CodeBERT yields higher performance (F1 score: 91.22%, partial match accuracy: 91.30%) but requires additional training, maintenance effort, and resources.

Retrieval-Augmented Few-Shot Prompting Versus Fine-Tuning for Code Vulnerability Detection

TL;DR

The paper examines retrieval-augmented few-shot prompting as a cost-effective alternative to fine-tuning for multi-label code vulnerability detection. It systematically compares three prompting strategies against zero-shot and fine-tuned baselines across Gemini-1.5-Flash and open-source models, showing that semantic retrieval of in-context examples consistently improves performance without training. While open-source fine-tuning (CodeBERT) can achieve higher absolute accuracy, retrieval-augmented prompting offers a favorable cost-performance balance, especially in low-resource settings. The findings underscore the value of example quality and retrieval in structuring in-context guidance for complex code-analysis tasks.

Abstract

Few-shot prompting has emerged as a practical alternative to fine-tuning for leveraging the capabilities of large language models (LLMs) in specialized tasks. However, its effectiveness depends heavily on the selection and quality of in-context examples, particularly in complex domains. In this work, we examine retrieval-augmented prompting as a strategy to improve few-shot performance in code vulnerability detection, where the goal is to identify one or more security-relevant weaknesses present in a given code snippet from a predefined set of vulnerability categories. We perform a systematic evaluation using the Gemini-1.5-Flash model across three approaches: (1) standard few-shot prompting with randomly selected examples, (2) retrieval-augmented prompting using semantically similar examples, and (3) retrieval-based labeling, which assigns labels based on retrieved examples without model inference. Our results show that retrieval-augmented prompting consistently outperforms the other prompting strategies. At 20 shots, it achieves an F1 score of 74.05% and a partial match accuracy of 83.90%. We further compare this approach against zero-shot prompting and several fine-tuned models, including Gemini-1.5-Flash and smaller open-source models such as DistilBERT, DistilGPT2, and CodeBERT. Retrieval-augmented prompting outperforms both zero-shot (F1 score: 36.35%, partial match accuracy: 20.30%) and fine-tuned Gemini (F1 score: 59.31%, partial match accuracy: 53.10%), while avoiding the training time and cost associated with model fine-tuning. On the other hand, fine-tuning CodeBERT yields higher performance (F1 score: 91.22%, partial match accuracy: 91.30%) but requires additional training, maintenance effort, and resources.

Paper Structure

This paper contains 17 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Summary of the three investigated few-shot strategies
  • Figure 2: The various prompts used for vulnerability detection
  • Figure 3: Performance evolution across shot counts (1-10) for the various metrics