Exploiting Code Symmetries for Learning Program Semantics

Kexin Pei; Weichen Li; Qirui Jin; Shuyang Liu; Scott Geng; Lorenzo Cavallaro; Junfeng Yang; Suman Jana

Exploiting Code Symmetries for Learning Program Semantics

Kexin Pei, Weichen Li, Qirui Jin, Shuyang Liu, Scott Geng, Lorenzo Cavallaro, Junfeng Yang, Suman Jana

TL;DR

This paper develops a novel variant of self-attention that is provably equivariant to code symmetries from the permutation group defined over the program dependence graph, and suggests that code LLMs that encode the code structural prior via the code symmetry group generalize better and faster.

Abstract

This paper tackles the challenge of teaching code semantics to Large Language Models (LLMs) for program analysis by incorporating code symmetries into the model architecture. We introduce a group-theoretic framework that defines code symmetries as semantics-preserving transformations, where forming a code symmetry group enables precise and efficient reasoning of code semantics. Our solution, SymC, develops a novel variant of self-attention that is provably equivariant to code symmetries from the permutation group defined over the program dependence graph. SymC obtains superior performance on five program analysis tasks, outperforming state-of-the-art code models without any pre-training. Our results suggest that code LLMs that encode the code structural prior via the code symmetry group generalize better and faster.

Exploiting Code Symmetries for Learning Program Semantics

TL;DR

Abstract

Paper Structure (28 sections, 9 theorems, 11 figures, 8 tables)

This paper contains 28 sections, 9 theorems, 11 figures, 8 tables.

Introduction
Preliminaries
Method
Invariance & Equivariance for Code Models
Semantics-Preserving Program Symmetries
$Aut(\mathcal{IG})$: A Program Symmetry Group
$Aut(\mathcal{IG})$-Equivariant Code Representation
$Aut(\mathcal{IG})$-Invariant Predictor
SymC Implementation
Relaxing $\mathcal{IG}$ to Program Dependence Graph
Encoding Graph Structure
Experimental Setup
Evaluation
Invariance and Generalization
Training Efficiency
...and 13 more sections

Key Result

Theorem 1

The set of automorphisms $\sigma\in Aut(\mathcal{IG})$ forms a program symmetry group.

Figures (11)

Figure 1: Invariance violation rate across different code models (darker colors indicate more violations).
Figure 2: Each cluster represents the learned embeddings of a code block and its semantics-preserving permuted versions. The cluster's dispersion (variances to the mean) indicates that the permutation changes the embeddings, while the color turning from blue to red indicates the changed predictions. We take the mean of SymC's embedding so it becomes permutation-invariant (§\ref{['subsec:ig_invariant_predictive_learning']}).
Figure 3: Simplified SymC architecture, which takes as input the code block and its program dependence graph (PDG) to construct the $Aut(P\!D\!G)$-equivariant self-attention head.
Figure 4: The performance (F1) of SymC and baselines against different unseen code transformations defined in §\ref{['sec:experimental_setup']}.
Figure 5: Evaluation on unseen optimization and obfuscation (marked in pink). We also include the testing results on seen optimizations and obfuscations (but the testing samples are non-overlapping with the training) on the left.
...and 6 more figures

Theorems & Definitions (18)

Definition 3.1
Definition 3.2: $G$-equivariant code representation learning
Definition 3.3: $G$-invariant code predictive learning
Definition 3.4
Definition 3.5
Definition 3.6: $\mathcal{IG}$ Automorphism
Theorem 1
Theorem 2
Lemma 1
Lemma 2
...and 8 more

Exploiting Code Symmetries for Learning Program Semantics

TL;DR

Abstract

Exploiting Code Symmetries for Learning Program Semantics

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (18)