Towards Causal Deep Learning for Vulnerability Detection

Md Mahbubur Rahman; Ira Ceka; Chengzhi Mao; Saikat Chakraborty; Baishakhi Ray; Wei Le

Towards Causal Deep Learning for Vulnerability Detection

Md Mahbubur Rahman, Ira Ceka, Chengzhi Mao, Saikat Chakraborty, Baishakhi Ray, Wei Le

TL;DR

This work tackles the poor robustness and poor out-of-distribution generalization of deep-learning vulnerability detectors by identifying and removing spurious features that rely on lexical cues like variable and API names. It introduces CausalVul, a two-phase framework that first uncovers spurious features via semantic-preserving perturbations (PerturbVar, PerturbAPI, PerturbJoint) and then applies do-calculus-based causal learning to suppress these features during inference. By training a model to encode spurious cues in a separate channel and performing backdoor-adjusted inference, CausalVul improves in-distribution accuracy and dramatically boosts robustness and cross-dataset generalization on Devign and Big-Vul across three transformer-based code models (CodeBERT, GraphCodeBERT, UniXcoder). The results demonstrate that causal learning can yield more reliable vulnerability detection, with significant gains in perturbed and OOD scenarios, offering a path toward deployable, real-world SE tooling. Future work will expand spurious-feature discovery and extend causal learning to additional software engineering tasks.

Abstract

Deep learning vulnerability detection has shown promising results in recent years. However, an important challenge that still blocks it from being very useful in practice is that the model is not robust under perturbation and it cannot generalize well over the out-of-distribution (OOD) data, e.g., applying a trained model to unseen projects in real world. We hypothesize that this is because the model learned non-robust features, e.g., variable names, that have spurious correlations with labels. When the perturbed and OOD datasets no longer have the same spurious features, the model prediction fails. To address the challenge, in this paper, we introduced causality into deep learning vulnerability detection. Our approach CausalVul consists of two phases. First, we designed novel perturbations to discover spurious features that the model may use to make predictions. Second, we applied the causal learning algorithms, specifically, do-calculus, on top of existing deep learning models to systematically remove the use of spurious features and thus promote causal based prediction. Our results show that CausalVul consistently improved the model accuracy, robustness and OOD performance for all the state-of-the-art models and datasets we experimented. To the best of our knowledge, this is the first work that introduces do calculus based causal learning to software engineering models and shows it's indeed useful for improving the model accuracy, robustness and generalization. Our replication package is located at https://figshare.com/s/0ffda320dcb96c249ef2.

Towards Causal Deep Learning for Vulnerability Detection

TL;DR

Abstract

Paper Structure (28 sections, 1 equation, 7 figures, 10 tables, 2 algorithms)

This paper contains 28 sections, 1 equation, 7 figures, 10 tables, 2 algorithms.

Introduction
An overview of our Approach and its Novelty
Discovering Spurious Features.
Removing Spurious Features.
Discovering Spurious Features
Problem Formulation
PerturbVar: Variable Name as a Spurious Feature
PerturbAPI: API Name as a Spurious Feature
PerturbJoint: Combine Them Together
Summary.
Causal Learning to Remove Spurious Features
Causal Graph for Vulnerability Detection
Applying Causality
Estimating Causality through Observational Data
The algorithms of Causal Vulnerability Detection
...and 13 more sections

Figures (7)

Figure 1: A vulnerable example predicted as vulnerable with 0.9493 but predicted as non-vulnerable with probability 0.2270 when names are perturbed by some of the spurious names from the opposite class.
Figure 2: Visualization using Principle Component Analysis (PCA) of Figure 1’s code representations generated by CodeBERT before and after perturbed the names.
Figure 3: CausalVul: an overview
Figure 4: Dead-code composed of our spurious feature, API calls
Figure 5: Causal Graph Before and After Do Calculus
...and 2 more figures

Towards Causal Deep Learning for Vulnerability Detection

TL;DR

Abstract

Towards Causal Deep Learning for Vulnerability Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (7)