Table of Contents
Fetching ...

SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis

Yansong Li, Paula Branco, Alexander M. Hoole, Manish Marwah, Hari Manassery Koduvely, Guy-Vincent Jourdan, Stephan Jou

TL;DR

SV-TrustEval-C introduces a structure- and semantic-focused benchmark for evaluating LLMs on C code vulnerability analysis. It pairs a Flow Extractor and Behaviour Simulator to generate structure-oriented variants (data/control flow) and semantic perturbations (safe/unsafe/impaired), and then asks LLMs to answer data-flow, control-flow, and semantic questions across Counterfactual, Goal-driven, and Predictive tasks. Across 11 models and multiple scales, the study finds that most LLMs rely on pattern matching rather than robust logical reasoning, with notably weaker performance in semantic reasoning and inconsistent behavior under code perturbations. The benchmark demonstrates strong coverage (377 base files, 82 CWEs, 9,401 questions) and generalizes to datasets like PrimeVul, underscoring the need for domain-adapted training and prompt design to improve trustworthiness in code vulnerability analysis. Overall, SV-TrustEval-C provides a scalable, extensible framework to quantify structural and semantic reasoning in LLM-based vulnerability analysis and highlights critical directions for future improvements.

Abstract

As Large Language Models (LLMs) evolve in understanding and generating code, accurately evaluating their reliability in analyzing source code vulnerabilities becomes increasingly vital. While studies have examined LLM capabilities in tasks like vulnerability detection and repair, they often overlook the importance of both structure and semantic reasoning crucial for trustworthy vulnerability analysis. To address this gap, we introduce SV-TrustEval-C, a benchmark designed to evaluate LLMs' abilities for vulnerability analysis of code written in the C programming language through two key dimensions: structure reasoning - assessing how models identify relationships between code elements under varying data and control flow complexities; and semantic reasoning - examining their logical consistency in scenarios where code is structurally and semantically perturbed. Our results show that current LLMs are far from satisfactory in understanding complex code relationships and that their vulnerability analyses rely more on pattern matching than on robust logical reasoning. These findings underscore the effectiveness of the SV-TrustEval-C benchmark and highlight critical areas for enhancing the reasoning capabilities and trustworthiness of LLMs in real-world vulnerability analysis tasks. Our initial benchmark dataset is publicly available.

SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis

TL;DR

SV-TrustEval-C introduces a structure- and semantic-focused benchmark for evaluating LLMs on C code vulnerability analysis. It pairs a Flow Extractor and Behaviour Simulator to generate structure-oriented variants (data/control flow) and semantic perturbations (safe/unsafe/impaired), and then asks LLMs to answer data-flow, control-flow, and semantic questions across Counterfactual, Goal-driven, and Predictive tasks. Across 11 models and multiple scales, the study finds that most LLMs rely on pattern matching rather than robust logical reasoning, with notably weaker performance in semantic reasoning and inconsistent behavior under code perturbations. The benchmark demonstrates strong coverage (377 base files, 82 CWEs, 9,401 questions) and generalizes to datasets like PrimeVul, underscoring the need for domain-adapted training and prompt design to improve trustworthiness in code vulnerability analysis. Overall, SV-TrustEval-C provides a scalable, extensible framework to quantify structural and semantic reasoning in LLM-based vulnerability analysis and highlights critical directions for future improvements.

Abstract

As Large Language Models (LLMs) evolve in understanding and generating code, accurately evaluating their reliability in analyzing source code vulnerabilities becomes increasingly vital. While studies have examined LLM capabilities in tasks like vulnerability detection and repair, they often overlook the importance of both structure and semantic reasoning crucial for trustworthy vulnerability analysis. To address this gap, we introduce SV-TrustEval-C, a benchmark designed to evaluate LLMs' abilities for vulnerability analysis of code written in the C programming language through two key dimensions: structure reasoning - assessing how models identify relationships between code elements under varying data and control flow complexities; and semantic reasoning - examining their logical consistency in scenarios where code is structurally and semantically perturbed. Our results show that current LLMs are far from satisfactory in understanding complex code relationships and that their vulnerability analyses rely more on pattern matching than on robust logical reasoning. These findings underscore the effectiveness of the SV-TrustEval-C benchmark and highlight critical areas for enhancing the reasoning capabilities and trustworthiness of LLMs in real-world vulnerability analysis tasks. Our initial benchmark dataset is publicly available.

Paper Structure

This paper contains 50 sections, 12 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Overview of the SV-TrustEval-C Benchmark: This figure highlights question-answering templates for Structure Reasoning, divided into control-flow and data-flow segments, alongside Semantic Reasoning, encompassing Counterfactual, Goal-driven, and Predictive tasks. Each section illustrates various question types and potential answer scenarios, designed to assess LLMs’ abilities to handle the diverse complexities of code semantics and structure.
  • Figure 2: Overview of the Outer, Inner, and Outer&Inner structures. The impaired block prints or returns None and does not mimic the behavior of either $C_{\text{safe}}$ or $C_{\text{unsafe}}$. Here, $C_{\text{safe}}\cap C_{\text{unsafe}}$ represents the common parts of both versions, while $C_{\text{safe}}\setminus C_{\text{unsafe}}$ denotes the unique parts of $C_{\text{safe}}$.
  • Figure 3: Overview of benchmark question distribution across two core capabilities, five question types, and multiple difficulty levels. The notation -\\ w CI indicates code variants with control flow injection. The term None indicates that there is no connection within the dataflow or control flow graph in the structure reasoning scenario. The question generation strategy is designed to ensure a uniform distribution across each base file, prioritizing high difficulty levels.
  • Figure 4: Prompt template for In-context learning pipeline
  • Figure 5: Overall consistency scores of LLMs on SV-TrustEval-C. For CodeLlama, the average consistency score across the 7B and 13B versions is reported due to their high similarity in performance. Abbreviations: DFL (DataFlow-wise questions), CFL (ControlFlow-wise questions), CTF (Counterfactual questions), GDV (Goal-driven questions), PRD (Predictive questions).
  • ...and 8 more figures