Table of Contents
Fetching ...

Bringing Structure to Naturalness: On the Naturalness of ASTs

Profir-Petru Pârţachi, Mahito Sugiyama

TL;DR

This work formalizes the Structured Naturalness hypothesis, proposing that code viewed through structured representations such as ASTs exhibits rich statistical regularities akin to token-level naturalness. The authors implement TreeLSTMs on ASTs to predict masked tokens and quantify predictability via self-cross-entropy, revealing language-dependent results and Zipfian patterns in AST labels. While structure enhances certain languages and enables effective downstream use, it does not uniformly improve language modeling across all languages, suggesting structure may serve as a useful side-channel for tasks beyond pure language modeling. The paper further demonstrates that AST-based signals can drive near-state-of-the-art just-in-time defect prediction with minimal feature engineering, highlighting practical value and motivating broader exploration of structured representations in both software engineering and NLP contexts.

Abstract

Source code comes in different shapes and forms. Previous research has already shown code to be more predictable than natural language as well as highlighted its statistical predictability at the token level: source code can be natural. More recently, the structure of code -- control flow, syntax graphs, abstract syntax trees etc. -- has been successfully used to improve the state-of-the-art on numerous tasks: code suggestion, code summarisation, method naming etc. This body of work implicitly assumes that structured representations of code are similarly statistically predictable, i.e. that a structured view of code is also natural. We consider that this view should be made explicit and propose directly studying the Structured Naturalness Hypothesis. Beyond just naming existing research that assumes this hypothesis and formulating it, we also provide evidence in the case of trees: TreeLSTM models over ASTs for some languages, such as Ruby, are competitive with $n$-gram models while handling the syntax token issue highlighted by previous research 'for free'. For other languages, such as Java or Python, we find tree models to perform worse, suggesting that downstream task improvement is uncorrelated to the language modelling task. Further, we show how such naturalness signals can be employed for near state-of-the-art results on just-in-time defect prediction while forgoing manual feature engineering work.

Bringing Structure to Naturalness: On the Naturalness of ASTs

TL;DR

This work formalizes the Structured Naturalness hypothesis, proposing that code viewed through structured representations such as ASTs exhibits rich statistical regularities akin to token-level naturalness. The authors implement TreeLSTMs on ASTs to predict masked tokens and quantify predictability via self-cross-entropy, revealing language-dependent results and Zipfian patterns in AST labels. While structure enhances certain languages and enables effective downstream use, it does not uniformly improve language modeling across all languages, suggesting structure may serve as a useful side-channel for tasks beyond pure language modeling. The paper further demonstrates that AST-based signals can drive near-state-of-the-art just-in-time defect prediction with minimal feature engineering, highlighting practical value and motivating broader exploration of structured representations in both software engineering and NLP contexts.

Abstract

Source code comes in different shapes and forms. Previous research has already shown code to be more predictable than natural language as well as highlighted its statistical predictability at the token level: source code can be natural. More recently, the structure of code -- control flow, syntax graphs, abstract syntax trees etc. -- has been successfully used to improve the state-of-the-art on numerous tasks: code suggestion, code summarisation, method naming etc. This body of work implicitly assumes that structured representations of code are similarly statistically predictable, i.e. that a structured view of code is also natural. We consider that this view should be made explicit and propose directly studying the Structured Naturalness Hypothesis. Beyond just naming existing research that assumes this hypothesis and formulating it, we also provide evidence in the case of trees: TreeLSTM models over ASTs for some languages, such as Ruby, are competitive with -gram models while handling the syntax token issue highlighted by previous research 'for free'. For other languages, such as Java or Python, we find tree models to perform worse, suggesting that downstream task improvement is uncorrelated to the language modelling task. Further, we show how such naturalness signals can be employed for near state-of-the-art results on just-in-time defect prediction while forgoing manual feature engineering work.

Paper Structure

This paper contains 15 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Log-Log plot of AST Labels (inner nodes) against their rank by frequency in the corpus. The plots show clear linear trends in log scale across the languages indicative of Zipf's law (negative slopes ($\theta$) and r$^2$-values above $0.86$). Thus we expect a similar naturalness in tree contexts as was previously shown in $n$-gram models at the token level.
  • Figure 2: \ref{['fig:method:ast_ex']} presents an example AST fragment. \ref{['fig:method:treelstmcell']} presents the TreeLSTM cell for the highlighted "Arg" node in \ref{['fig:method:ast_ex']}. We annotate the paths in \ref{['fig:method:treelstmcell']} with the corresponding labels from Equations (1)--(7) replacing the subscripts with the corresponding node labels from the AST fragment.
  • Figure 3: Self-cross-entropy of TreeLSTM for different limits of the sequence length across languages.
  • Figure 4: Normalised log-frequency of AST node labels vs their rank in Apache Commons Lang and Math. Hue shows if the commit is defect-inducing. The two samples (defect inducing and non-defect inducing) are of equal size as we use the same method before and after the introduction of the bug.
  • Figure 5: The training process for determining if a method AST is defect-inducing or clean. We assume that the input has already been preprocessed to be ASTs. The orange lines show how a candidate AST passes through the system, i.e. is transformed by the embedding space projection before being classified by a Random Forest Classifier.