Table of Contents
Fetching ...

Program Analysis via Multiple Context Free Language Reachability

Giovanna Kobus Conrado, Adam Husted Kjelstrøm, Andreas Pavlogiannis, Jaco van de Pol

TL;DR

This paper introduces Multiple Context-Free Language (MCFL) reachability as an expressive yet tractable model for static program analysis and demonstrates the utility of MCFL reachability by developing a family of MCFLs that approximate interleaved Dyck reachability, a common but undecidable static analysis problem.

Abstract

Context-free language (CFL) reachability is a standard approach in static analyses, where the analysis question is phrased as a language reachability problem on a graph $G$ wrt a CFL L. While CFLs lack the expressiveness needed for high precision, common formalisms for context-sensitive languages are such that the corresponding reachability problem is undecidable. Are there useful context-sensitive language-reachability models for static analysis? In this paper, we introduce Multiple Context-Free Language (MCFL) reachability as an expressive yet tractable model for static program analysis. MCFLs form an infinite hierarchy of mildly context sensitive languages parameterized by a dimension $d$ and a rank $r$. We show the utility of MCFL reachability by developing a family of MCFLs that approximate interleaved Dyck reachability, a common but undecidable static analysis problem. We show that MCFL reachability be computed in $O(n^{2d+1})$ time on a graph of $n$ nodes when $r=1$, and $O(n^{d(r+1)})$ time when $r>1$. Moreover, we show that when $r=1$, the membership problem has a lower bound of $n^{2d}$ based on the Strong Exponential Time Hypothesis, while reachability for $d=1$ has a lower bound of $n^{3}$ based on the combinatorial Boolean Matrix Multiplication Hypothesis. Thus, for $r=1$, our algorithm is optimal within a factor $n$ for all levels of the hierarchy based on $d$. We implement our MCFL reachability algorithm and evaluate it by underapproximating interleaved Dyck reachability for a standard taint analysis for Android. Used alongside existing overapproximate methods, MCFL reachability discovers all tainted information on 8 out of 11 benchmarks, and confirms $94.3\%$ of the reachable pairs reported by the overapproximation on the remaining 3. To our knowledge, this is the first report of high and provable coverage for this challenging benchmark set.

Program Analysis via Multiple Context Free Language Reachability

TL;DR

This paper introduces Multiple Context-Free Language (MCFL) reachability as an expressive yet tractable model for static program analysis and demonstrates the utility of MCFL reachability by developing a family of MCFLs that approximate interleaved Dyck reachability, a common but undecidable static analysis problem.

Abstract

Context-free language (CFL) reachability is a standard approach in static analyses, where the analysis question is phrased as a language reachability problem on a graph wrt a CFL L. While CFLs lack the expressiveness needed for high precision, common formalisms for context-sensitive languages are such that the corresponding reachability problem is undecidable. Are there useful context-sensitive language-reachability models for static analysis? In this paper, we introduce Multiple Context-Free Language (MCFL) reachability as an expressive yet tractable model for static program analysis. MCFLs form an infinite hierarchy of mildly context sensitive languages parameterized by a dimension and a rank . We show the utility of MCFL reachability by developing a family of MCFLs that approximate interleaved Dyck reachability, a common but undecidable static analysis problem. We show that MCFL reachability be computed in time on a graph of nodes when , and time when . Moreover, we show that when , the membership problem has a lower bound of based on the Strong Exponential Time Hypothesis, while reachability for has a lower bound of based on the combinatorial Boolean Matrix Multiplication Hypothesis. Thus, for , our algorithm is optimal within a factor for all levels of the hierarchy based on . We implement our MCFL reachability algorithm and evaluate it by underapproximating interleaved Dyck reachability for a standard taint analysis for Android. Used alongside existing overapproximate methods, MCFL reachability discovers all tainted information on 8 out of 11 benchmarks, and confirms of the reachable pairs reported by the overapproximation on the remaining 3. To our knowledge, this is the first report of high and provable coverage for this challenging benchmark set.

Paper Structure

This paper contains 26 sections, 26 theorems, 54 equations, 9 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

All-pairs $d\text{-}\operatorname{MCFL}(r)$-reachability given a grammar $\mathcal{G}$ on a graph $G$ of $n$ nodes can be solved in

Figures (9)

  • Figure 1: A program $P$ containing three functions tie(), foo(), bar() and composite objects of type pair.
  • Figure 2: Two graphs modeling context and field sensitivity through edge labels.
  • Figure 3: A derivation tree corresponding for the MCFG $\mathcal{G}$ for the Language $\mathcal{L}=\{w_1w_2\#w_2w_1\mid w_1,w_2\in\{0,1\}^*\}$, executed on the word $0100\#0001$. Edge labels indicate the corresponding derivation rule.
  • Figure 4: A derivation tree witnessing $\textcolor{mybluecolor}{(_1}\textcolor{black!20!red}{[_1}\textcolor{mybluecolor}{)_1}\textcolor{black!20!red}{[_2}\textcolor{black!20!red}{]_2}\textcolor{mybluecolor}{(_2}\textcolor{black!20!red}{]_1}\textcolor{mybluecolor}{)_2}\in \mathcal{L}(\mathcal{G}_4)$.
  • Figure 5: Interleavings of elements in $P^4$ and $Q^4$ to form a string in $\mathcal{S}$.
  • ...and 4 more figures

Theorems & Definitions (37)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Theorem 3
  • Lemma 6
  • ...and 27 more