Table of Contents
Fetching ...

Customizing Static Analysis using Codesearch

Avi Hayoun, Veselin Raychev, Jack Hair

TL;DR

The paper addresses the challenge of building customizable static analyses via an accessible yet efficient language. It introduces StarLang, a fast, monadic Datalog subset with stratified negation and unary predicates, designed to enable real-time rule authoring for static analysis. A practical frontend, Codesearch, provides a lightweight interface and standard library to express complex analyses (taint, dataflow, typestate, points-to) with templates and autocompletion. Through case studies across tainting, dataflow, typestate, and linting, the approach demonstrates expressive power and real-time feedback, offering a practical path for integrating semantic analyses into large code repositories. Overall, StarLang and Codesearch deliver a user-friendly DSL that supports efficient, scalable static-analysis queries with meaningful real-world applicability.

Abstract

Static analysis is a growing application of software engineering, leading to a range of essential security tools, bug-finding tools, as well as software verification. Recent years show an increase of universal static analysis tools that validate a range of properties and allow customizing parts of the scanner to validate additional properties or "static analysis rules". A commonly used language to describe a range of static analysis applications is Datalog. Unfortunately, the language is still non-trivial to use, leading to analysis that is difficult to implement in a precise but performant way. In this work, we aim to make building custom static analysis tools much easier for developers, while at the same time, providing a familiar framework for application security and static analysis experts. Our approach introduces a language called StarLang, a variant of Datalog which only includes programs with a fast runtime by the means of having low time complexity of its decision procedure.

Customizing Static Analysis using Codesearch

TL;DR

The paper addresses the challenge of building customizable static analyses via an accessible yet efficient language. It introduces StarLang, a fast, monadic Datalog subset with stratified negation and unary predicates, designed to enable real-time rule authoring for static analysis. A practical frontend, Codesearch, provides a lightweight interface and standard library to express complex analyses (taint, dataflow, typestate, points-to) with templates and autocompletion. Through case studies across tainting, dataflow, typestate, and linting, the approach demonstrates expressive power and real-time feedback, offering a practical path for integrating semantic analyses into large code repositories. Overall, StarLang and Codesearch deliver a user-friendly DSL that supports efficient, scalable static-analysis queries with meaningful real-world applicability.

Abstract

Static analysis is a growing application of software engineering, leading to a range of essential security tools, bug-finding tools, as well as software verification. Recent years show an increase of universal static analysis tools that validate a range of properties and allow customizing parts of the scanner to validate additional properties or "static analysis rules". A commonly used language to describe a range of static analysis applications is Datalog. Unfortunately, the language is still non-trivial to use, leading to analysis that is difficult to implement in a precise but performant way. In this work, we aim to make building custom static analysis tools much easier for developers, while at the same time, providing a familiar framework for application security and static analysis experts. Our approach introduces a language called StarLang, a variant of Datalog which only includes programs with a fast runtime by the means of having low time complexity of its decision procedure.
Paper Structure (35 sections, 12 equations, 2 figures, 1 algorithm)

This paper contains 35 sections, 12 equations, 2 figures, 1 algorithm.

Figures (2)

  • Figure 1: The Snyk Code Codesearch system. The components in the gray box comprise the system itself, with the inputs and outputs placed above and below respectively. Rectangles represent computations, while ellipses represent data. The "Codesearch STL" (STandard Library) node is on the border of the system, acting as an API to the underlying StarLang language.
  • Figure 2: Auto-completion for the argument of the template in the context of the code snippet from Example \ref{['example:codesearch-query']}. The most highly-ranked suggestions are all names of functions that are called in the code snippet.

Theorems & Definitions (26)

  • Example 1
  • Example 2
  • Remark 1
  • Example 3
  • Example 4
  • Definition 3.1: Valid starlang rules i
  • Example 5
  • Remark 2
  • Remark 3
  • Definition 3.2: Valid StarLang rules ii
  • ...and 16 more