Table of Contents
Fetching ...

Unveiling Code Pre-Trained Models: Investigating Syntax and Semantics Capacities

Wei Ma, Shangqing Liu, Mengjie Zhao, Xiaofei Xie, Wenhan Wang, Qiang Hu, Jie Zhang, Yang Liu

TL;DR

This work probes seven code models (four pre-trained and three LLMs) to understand how they encode code syntax and semantics. It introduces four probing tasks that directly reconstruct AST, CFG, CDG, and DDG structures from model representations, along with attention-analysis to examine semantic focus. The results show that code syntax is largely well-captured, especially in shallower layers, while code semantics are more variably represented and depend on architecture (encoder vs decoder) and training regime. The findings offer guidance for designing training strategies that better integrate syntax and semantics, and for interpreting model outputs in downstream code tasks and security-sensitive applications. Overall, the study lays a foundation for improving code-model representations and informs practical use in software engineering tasks.

Abstract

Past research has examined how well these models grasp code syntax, yet their understanding of code semantics still needs to be explored. We extensively analyze seven code models to investigate how code models represent code syntax and semantics. This includes four prominent code pre-trained models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) and three large language models (StarCoder, CodeLlama, and CodeT5+). We have developed four probing tasks to evaluate the models' abilities to learn code syntax and semantics. These tasks focus on reconstructing code syntax and semantic structures-such as AST, CFG, CDG, and DDG - within the models' representation spaces. These structures are fundamental to understanding code. Additionally, we explore the role of syntax tokens in each token representation and the extended dependencies among code tokens. Furthermore, we examine the distribution of attention weights concerning code semantic structures. Through detailed analysis, our results emphasize the strengths and weaknesses of various code models in mastering code syntax and semantics. The findings reveal that these models are proficient in grasping code syntax, effectively capturing the relationships and roles of syntax tokens. However, their ability to encode code semantics shows more variability. This study enriches our understanding of the capabilities of code models in analyzing syntax and semantics. Our findings offer valuable insights for future code model enhancements, helping optimize their application across a range of code-related tasks.

Unveiling Code Pre-Trained Models: Investigating Syntax and Semantics Capacities

TL;DR

This work probes seven code models (four pre-trained and three LLMs) to understand how they encode code syntax and semantics. It introduces four probing tasks that directly reconstruct AST, CFG, CDG, and DDG structures from model representations, along with attention-analysis to examine semantic focus. The results show that code syntax is largely well-captured, especially in shallower layers, while code semantics are more variably represented and depend on architecture (encoder vs decoder) and training regime. The findings offer guidance for designing training strategies that better integrate syntax and semantics, and for interpreting model outputs in downstream code tasks and security-sensitive applications. Overall, the study lays a foundation for improving code-model representations and informs practical use in software engineering tasks.

Abstract

Past research has examined how well these models grasp code syntax, yet their understanding of code semantics still needs to be explored. We extensively analyze seven code models to investigate how code models represent code syntax and semantics. This includes four prominent code pre-trained models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) and three large language models (StarCoder, CodeLlama, and CodeT5+). We have developed four probing tasks to evaluate the models' abilities to learn code syntax and semantics. These tasks focus on reconstructing code syntax and semantic structures-such as AST, CFG, CDG, and DDG - within the models' representation spaces. These structures are fundamental to understanding code. Additionally, we explore the role of syntax tokens in each token representation and the extended dependencies among code tokens. Furthermore, we examine the distribution of attention weights concerning code semantic structures. Through detailed analysis, our results emphasize the strengths and weaknesses of various code models in mastering code syntax and semantics. The findings reveal that these models are proficient in grasping code syntax, effectively capturing the relationships and roles of syntax tokens. However, their ability to encode code semantics shows more variability. This study enriches our understanding of the capabilities of code models in analyzing syntax and semantics. Our findings offer valuable insights for future code model enhancements, helping optimize their application across a range of code-related tasks.
Paper Structure (32 sections, 7 equations, 14 figures, 6 tables)

This paper contains 32 sections, 7 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: A simple code snippet with its AST.
  • Figure 2: Euclidean distance of token representations.
  • Figure 3: Syntax Pair Node Prediction.
  • Figure 4: Analysis model.
  • Figure 5: Token Syntax Tagging.
  • ...and 9 more figures