Table of Contents
Fetching ...

CodeSAM: Source Code Representation Learning by Infusing Self-Attention with Multi-Code-View Graphs

Alex Mathai, Kranthi Sedamaki, Debeshee Das, Noble Saji Mathews, Srikanth Tamilselvam, Sridhar Chimalakonda, Atul Kumar

TL;DR

This work proposes CodeSAM, a novel scalable framework to infuse multiple code-views into transformer-based models by creating self-attention masks, and uses it to fine-tune a small language model (SLM) like CodeBERT on the downstream SE tasks of semantic code search, code clone detection, and program classification.

Abstract

Machine Learning (ML) for software engineering (SE) has gained prominence due to its ability to significantly enhance the performance of various SE applications. This progress is largely attributed to the development of generalizable source code representations that effectively capture the syntactic and semantic characteristics of code. In recent years, pre-trained transformer-based models, inspired by natural language processing (NLP), have shown remarkable success in SE tasks. However, source code contains structural and semantic properties embedded within its grammar, which can be extracted from structured code-views like the Abstract Syntax Tree (AST), Data-Flow Graph (DFG), and Control-Flow Graph (CFG). These code-views can complement NLP techniques, further improving SE tasks. Unfortunately, there are no flexible frameworks to infuse arbitrary code-views into existing transformer-based models effectively. Therefore, in this work, we propose CodeSAM, a novel scalable framework to infuse multiple code-views into transformer-based models by creating self-attention masks. We use CodeSAM to fine-tune a small language model (SLM) like CodeBERT on the downstream SE tasks of semantic code search, code clone detection, and program classification. Experimental results show that by using this technique, we improve downstream performance when compared to SLMs like GraphCodeBERT and CodeBERT on all three tasks by utilizing individual code-views or a combination of code-views during fine-tuning. We believe that these results are indicative that techniques like CodeSAM can help create compact yet performant code SLMs that fit in resource constrained settings.

CodeSAM: Source Code Representation Learning by Infusing Self-Attention with Multi-Code-View Graphs

TL;DR

This work proposes CodeSAM, a novel scalable framework to infuse multiple code-views into transformer-based models by creating self-attention masks, and uses it to fine-tune a small language model (SLM) like CodeBERT on the downstream SE tasks of semantic code search, code clone detection, and program classification.

Abstract

Machine Learning (ML) for software engineering (SE) has gained prominence due to its ability to significantly enhance the performance of various SE applications. This progress is largely attributed to the development of generalizable source code representations that effectively capture the syntactic and semantic characteristics of code. In recent years, pre-trained transformer-based models, inspired by natural language processing (NLP), have shown remarkable success in SE tasks. However, source code contains structural and semantic properties embedded within its grammar, which can be extracted from structured code-views like the Abstract Syntax Tree (AST), Data-Flow Graph (DFG), and Control-Flow Graph (CFG). These code-views can complement NLP techniques, further improving SE tasks. Unfortunately, there are no flexible frameworks to infuse arbitrary code-views into existing transformer-based models effectively. Therefore, in this work, we propose CodeSAM, a novel scalable framework to infuse multiple code-views into transformer-based models by creating self-attention masks. We use CodeSAM to fine-tune a small language model (SLM) like CodeBERT on the downstream SE tasks of semantic code search, code clone detection, and program classification. Experimental results show that by using this technique, we improve downstream performance when compared to SLMs like GraphCodeBERT and CodeBERT on all three tasks by utilizing individual code-views or a combination of code-views during fine-tuning. We believe that these results are indicative that techniques like CodeSAM can help create compact yet performant code SLMs that fit in resource constrained settings.

Paper Structure

This paper contains 34 sections, 2 equations, 10 figures, 15 tables, 1 algorithm.

Figures (10)

  • Figure 1: Example $S'_7$ where $\mathbb{T}_{X} = \mathbb{T}_{A+D}$ (Data flow in Black Arrows)
  • Figure 2: End-to-End System to infuse code-view information into Transformer models
  • Figure 3: Example of CFG+DFG without RDA (Obtained using GraphCodeBERT's Dataflow logic)
  • Figure 4: Example of CFG+DFG generated by COMEX for a snippet that demonstrates the new and improved DFG implementation using RDA
  • Figure 5: Data-flow graph illustrating last definition and last usage based data-flow edges
  • ...and 5 more figures