Table of Contents
Fetching ...

Scalable Defect Detection via Traversal on Code Graph

Zhengyao Liu, Xitong Zhong, Xingjing Deng, Shuo Hong, Xiang Gao, Hailong Sun

TL;DR

This work tackles the scalability challenge of graph-based vulnerability detection in large codebases by introducing QVoG, a platform built around a compressed Code Property Graph (CPG). It combines a declarative, SQL-like DSL with a language-agnostic query engine and CodeBERT-based models to generalize vulnerability detection, particularly taint analysis. Empirical results show QVoG can process projects with >1,000,000 lines of code in roughly 15 minutes and achieves high precision/recall on Juliet CWE benchmarks, outperforming Joern and CodeQL in several scenarios. The approach yields a scalable, open-source framework for multi-language static analysis that blends graph traversal with machine learning to reduce over-specific rules and improve detection coverage.

Abstract

Detecting defects and vulnerabilities in the early stage has long been a challenge in software engineering. Static analysis, a technique that inspects code without execution, has emerged as a key strategy to address this challenge. Among recent advancements, the use of graph-based representations, particularly Code Property Graph (CPG), has gained traction due to its comprehensive depiction of code structure and semantics. Despite the progress, existing graph-based analysis tools still face performance and scalability issues. The main bottleneck lies in the size and complexity of CPG, which makes analyzing large codebases inefficient and memory-consuming. Also, query rules used by the current tools can be over-specific. Hence, we introduce QVoG, a graph-based static analysis platform for detecting defects and vulnerabilities. It employs a compressed CPG representation to maintain a reasonable graph size, thereby enhancing the overall query efficiency. Based on the CPG, it also offers a declarative query language to simplify the queries. Furthermore, it takes a step forward to integrate machine learning to enhance the generality of vulnerability detection. For projects consisting of 1,000,000+ lines of code, QVoG can complete analysis in approximately 15 minutes, as opposed to 19 minutes with CodeQL.

Scalable Defect Detection via Traversal on Code Graph

TL;DR

This work tackles the scalability challenge of graph-based vulnerability detection in large codebases by introducing QVoG, a platform built around a compressed Code Property Graph (CPG). It combines a declarative, SQL-like DSL with a language-agnostic query engine and CodeBERT-based models to generalize vulnerability detection, particularly taint analysis. Empirical results show QVoG can process projects with >1,000,000 lines of code in roughly 15 minutes and achieves high precision/recall on Juliet CWE benchmarks, outperforming Joern and CodeQL in several scenarios. The approach yields a scalable, open-source framework for multi-language static analysis that blends graph traversal with machine learning to reduce over-specific rules and improve detection coverage.

Abstract

Detecting defects and vulnerabilities in the early stage has long been a challenge in software engineering. Static analysis, a technique that inspects code without execution, has emerged as a key strategy to address this challenge. Among recent advancements, the use of graph-based representations, particularly Code Property Graph (CPG), has gained traction due to its comprehensive depiction of code structure and semantics. Despite the progress, existing graph-based analysis tools still face performance and scalability issues. The main bottleneck lies in the size and complexity of CPG, which makes analyzing large codebases inefficient and memory-consuming. Also, query rules used by the current tools can be over-specific. Hence, we introduce QVoG, a graph-based static analysis platform for detecting defects and vulnerabilities. It employs a compressed CPG representation to maintain a reasonable graph size, thereby enhancing the overall query efficiency. Based on the CPG, it also offers a declarative query language to simplify the queries. Furthermore, it takes a step forward to integrate machine learning to enhance the generality of vulnerability detection. For projects consisting of 1,000,000+ lines of code, QVoG can complete analysis in approximately 15 minutes, as opposed to 19 minutes with CodeQL.
Paper Structure (45 sections, 2 equations, 7 figures, 6 tables)

This paper contains 45 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Architecture of QVoG
  • Figure 2: Workflow of QVoG
  • Figure 3: Query engine architecture
  • Figure 4: Database adapter workflow
  • Figure 5: Workflow of Query Engine Combining Models
  • ...and 2 more figures