Table of Contents
Fetching ...

Exploiting Code Symmetries for Learning Program Semantics

Kexin Pei, Weichen Li, Qirui Jin, Shuyang Liu, Scott Geng, Lorenzo Cavallaro, Junfeng Yang, Suman Jana

TL;DR

This paper develops a novel variant of self-attention that is provably equivariant to code symmetries from the permutation group defined over the program dependence graph, and suggests that code LLMs that encode the code structural prior via the code symmetry group generalize better and faster.

Abstract

This paper tackles the challenge of teaching code semantics to Large Language Models (LLMs) for program analysis by incorporating code symmetries into the model architecture. We introduce a group-theoretic framework that defines code symmetries as semantics-preserving transformations, where forming a code symmetry group enables precise and efficient reasoning of code semantics. Our solution, SymC, develops a novel variant of self-attention that is provably equivariant to code symmetries from the permutation group defined over the program dependence graph. SymC obtains superior performance on five program analysis tasks, outperforming state-of-the-art code models without any pre-training. Our results suggest that code LLMs that encode the code structural prior via the code symmetry group generalize better and faster.

Exploiting Code Symmetries for Learning Program Semantics

TL;DR

This paper develops a novel variant of self-attention that is provably equivariant to code symmetries from the permutation group defined over the program dependence graph, and suggests that code LLMs that encode the code structural prior via the code symmetry group generalize better and faster.

Abstract

This paper tackles the challenge of teaching code semantics to Large Language Models (LLMs) for program analysis by incorporating code symmetries into the model architecture. We introduce a group-theoretic framework that defines code symmetries as semantics-preserving transformations, where forming a code symmetry group enables precise and efficient reasoning of code semantics. Our solution, SymC, develops a novel variant of self-attention that is provably equivariant to code symmetries from the permutation group defined over the program dependence graph. SymC obtains superior performance on five program analysis tasks, outperforming state-of-the-art code models without any pre-training. Our results suggest that code LLMs that encode the code structural prior via the code symmetry group generalize better and faster.
Paper Structure (28 sections, 9 theorems, 11 figures, 8 tables)

This paper contains 28 sections, 9 theorems, 11 figures, 8 tables.

Key Result

Theorem 1

The set of automorphisms $\sigma\in Aut(\mathcal{IG})$ forms a program symmetry group.

Figures (11)

  • Figure 1: Invariance violation rate across different code models (darker colors indicate more violations).
  • Figure 2: Each cluster represents the learned embeddings of a code block and its semantics-preserving permuted versions. The cluster's dispersion (variances to the mean) indicates that the permutation changes the embeddings, while the color turning from blue to red indicates the changed predictions. We take the mean of SymC's embedding so it becomes permutation-invariant (§\ref{['subsec:ig_invariant_predictive_learning']}).
  • Figure 3: Simplified SymC architecture, which takes as input the code block and its program dependence graph (PDG) to construct the $Aut(P\!D\!G)$-equivariant self-attention head.
  • Figure 4: The performance (F1) of SymC and baselines against different unseen code transformations defined in §\ref{['sec:experimental_setup']}.
  • Figure 5: Evaluation on unseen optimization and obfuscation (marked in pink). We also include the testing results on seen optimizations and obfuscations (but the testing samples are non-overlapping with the training) on the left.
  • ...and 6 more figures

Theorems & Definitions (18)

  • Definition 3.1
  • Definition 3.2: $G$-equivariant code representation learning
  • Definition 3.3: $G$-invariant code predictive learning
  • Definition 3.4
  • Definition 3.5
  • Definition 3.6: $\mathcal{IG}$ Automorphism
  • Theorem 1
  • Theorem 2
  • Lemma 1
  • Lemma 2
  • ...and 8 more