Table of Contents
Fetching ...

AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

Linyuan Gong, Mostafa Elhoushi, Alvin Cheung

TL;DR

<3-5 sentence high-level summary> Code-language models often treat code as unstructured sequences, neglecting syntax; AST-T5 tackles this by introducing AST-aware pretraining using Tree-sitter-parsed ASTs, DP-based segmentation, and AST-aware subtree masking within a standard encoder-decoder T5 framework. The authors demonstrate that these structure-aware pretraining cues improve code generation, transpilation, and understanding, outperforming similar-sized baselines and approaching larger models on several benchmarks. Importantly, AST-T5 remains architecture-agnostic and acts as a drop-in replacement for existing encoder-decoder LMs, with strong gains in code-to-code tasks such as Bugs2Fix and Java-C# transpilation, and robust performance on HumanEval/MBPP. The work suggests that targeted structural priors can yield substantial benefits for code-centric AI systems and opens paths for scaling and broader language coverage.

Abstract

Large language models (LLMs) have made significant advancements in code-related tasks, yet many LLMs treat code as simple sequences, neglecting its structured nature. We introduce AST-T5, a novel pretraining paradigm that leverages the Abstract Syntax Tree (AST) for enhanced code generation, transpilation, and understanding. Using dynamic programming, our AST-Aware Segmentation retains code structure, while our AST-Aware Span Corruption objective equips the model to reconstruct various code structures. Unlike other models, AST-T5 avoids intricate program analyses or architectural changes, so it integrates seamlessly with any encoder-decoder Transformer. Evaluations show that AST-T5 consistently outperforms similar-sized LMs across various code-related tasks. Structure-awareness makes AST-T5 particularly powerful in code-to-code tasks, surpassing CodeT5 by 2 points in exact match score for the Bugs2Fix task and by 3 points in exact match score for Java-C# Transpilation in CodeXGLUE. Our code and model are publicly available at https://github.com/gonglinyuan/ast_t5.

AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

TL;DR

<3-5 sentence high-level summary> Code-language models often treat code as unstructured sequences, neglecting syntax; AST-T5 tackles this by introducing AST-aware pretraining using Tree-sitter-parsed ASTs, DP-based segmentation, and AST-aware subtree masking within a standard encoder-decoder T5 framework. The authors demonstrate that these structure-aware pretraining cues improve code generation, transpilation, and understanding, outperforming similar-sized baselines and approaching larger models on several benchmarks. Importantly, AST-T5 remains architecture-agnostic and acts as a drop-in replacement for existing encoder-decoder LMs, with strong gains in code-to-code tasks such as Bugs2Fix and Java-C# transpilation, and robust performance on HumanEval/MBPP. The work suggests that targeted structural priors can yield substantial benefits for code-centric AI systems and opens paths for scaling and broader language coverage.

Abstract

Large language models (LLMs) have made significant advancements in code-related tasks, yet many LLMs treat code as simple sequences, neglecting its structured nature. We introduce AST-T5, a novel pretraining paradigm that leverages the Abstract Syntax Tree (AST) for enhanced code generation, transpilation, and understanding. Using dynamic programming, our AST-Aware Segmentation retains code structure, while our AST-Aware Span Corruption objective equips the model to reconstruct various code structures. Unlike other models, AST-T5 avoids intricate program analyses or architectural changes, so it integrates seamlessly with any encoder-decoder Transformer. Evaluations show that AST-T5 consistently outperforms similar-sized LMs across various code-related tasks. Structure-awareness makes AST-T5 particularly powerful in code-to-code tasks, surpassing CodeT5 by 2 points in exact match score for the Bugs2Fix task and by 3 points in exact match score for Java-C# Transpilation in CodeXGLUE. Our code and model are publicly available at https://github.com/gonglinyuan/ast_t5.
Paper Structure (33 sections, 3 figures, 7 tables, 3 algorithms)

This paper contains 33 sections, 3 figures, 7 tables, 3 algorithms.

Figures (3)

  • Figure 1: Comparison of AST-Aware Subtree Corruption and Vanilla T5 using a Python factorial function. Both methods replace masked spans with sentinel tokens (special tokens added to the vocabulary, shown as [X], [Y], and [Z] in the figure), with output sequences containing the original masked tokens. Inputs and targets are shown in byte-pair encoding (BPE); for instance, "factorial" is encoded into "fact" and "orial". Unlike Vanilla T5, which masks random spans without considering code structure, our approach specifically targets spans aligned with AST subtrees, like expressions and statements.
  • Figure 2: Comparison between Greedy Segmentation and AST-Aware Segmentation: For a 112-token code example with max_len set at 48, Greedy Segmentation places the first 48 tokens in Block 1, the next 48 tokens in Block 2, and the remaining in Block 3, disrupting the structural integrity of the code. In contrast, AST-Aware Segmentation uses a dynamic programming algorithm to smartly partition the code, aligning with boundaries of member functions or major function branches, thereby preserving the code's structure. The accompanying AST, with some levels pruned for clarity, corroborates that these segmentations indeed coincide with key subtree demarcations.
  • Figure 3: Visualizations of AST-T5's performance on HumanEval and MBPP compared to other models compared to models exceeding 300M parameters. Each point on each scatter plot represents a model. The x-axis shows the parameter count in log-scale, while the y-axis shows the Pass@1 rate on HumanEval or MBPP in log-scale. Model open-source status is color-coded: blue for open-source and red for proprietary.