Table of Contents
Fetching ...

Structured Code Representations Enable Data-Efficient Adaptation of Code Language Models

Mayank Agarwal, Yikang Shen, Bailin Wang, Yoon Kim, Jie Chen

TL;DR

This work tackles the inefficiency of code-language models that treat code as plain text by leveraging explicit program structure. It presents a plug-and-play approach that serializes concrete syntax trees (CSTs) and continues pre-training and fine-tuning of pre-trained models using CSTs, without architectural changes. It introduces MSP, MNP, and TeTr/TrTe objectives for encoder-decoder models, and applies causal LM training for decoder-only models on serialized CSTs. Across code translation, generation, and summarization tasks, the method yields substantial gains in low-data scenarios, demonstrating that integrating structure with text markedly improves data efficiency and robustness while preserving scalability. The results highlight practical impact for low-resource languages and domains, with significant reductions in erroneous translations and improved semantic correctness.

Abstract

Current language models tailored for code tasks often adopt the pre-training-then-fine-tuning paradigm from natural language processing, modeling source code as plain text. This approach, however, overlooks the unambiguous structures inherent in programming languages. In this work, we explore data-efficient adaptation of pre-trained code models by further pre-training and fine-tuning them with program structures. Specifically, we represent programs as parse trees -- also known as concrete syntax trees (CSTs) -- and adapt pre-trained models on serialized CSTs. Although the models that we adapt have been pre-trained only on the surface form of programs, we find that a small amount of continual pre-training and fine-tuning on CSTs without changing the model architecture yields improvements over the baseline approach across various code tasks. The improvements are found to be particularly significant when there are limited training examples, demonstrating the effectiveness of integrating program structures with plain-text representation even when working with backbone models that have not been pre-trained with structures.

Structured Code Representations Enable Data-Efficient Adaptation of Code Language Models

TL;DR

This work tackles the inefficiency of code-language models that treat code as plain text by leveraging explicit program structure. It presents a plug-and-play approach that serializes concrete syntax trees (CSTs) and continues pre-training and fine-tuning of pre-trained models using CSTs, without architectural changes. It introduces MSP, MNP, and TeTr/TrTe objectives for encoder-decoder models, and applies causal LM training for decoder-only models on serialized CSTs. Across code translation, generation, and summarization tasks, the method yields substantial gains in low-data scenarios, demonstrating that integrating structure with text markedly improves data efficiency and robustness while preserving scalability. The results highlight practical impact for low-resource languages and domains, with significant reductions in erroneous translations and improved semantic correctness.

Abstract

Current language models tailored for code tasks often adopt the pre-training-then-fine-tuning paradigm from natural language processing, modeling source code as plain text. This approach, however, overlooks the unambiguous structures inherent in programming languages. In this work, we explore data-efficient adaptation of pre-trained code models by further pre-training and fine-tuning them with program structures. Specifically, we represent programs as parse trees -- also known as concrete syntax trees (CSTs) -- and adapt pre-trained models on serialized CSTs. Although the models that we adapt have been pre-trained only on the surface form of programs, we find that a small amount of continual pre-training and fine-tuning on CSTs without changing the model architecture yields improvements over the baseline approach across various code tasks. The improvements are found to be particularly significant when there are limited training examples, demonstrating the effectiveness of integrating program structures with plain-text representation even when working with backbone models that have not been pre-trained with structures.
Paper Structure (38 sections, 4 equations, 15 figures, 6 tables)

This paper contains 38 sections, 4 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: An example Python program along with its CST (simplified for illustration) in the tree and serialized forms, respectively. Also illustrated are the masked subtree prediction and masked node prediction training objectives for adapting pre-trained models to code structures.
  • Figure 2: Code translation performance. Left two: Java $\leftrightarrow$ C# (CodeXGLUE); right two: Java $\leftrightarrow$ Python (TransCoder). For full results on all evaluation metrics, see the Appendix.
  • Figure 3: Code generation performance. From left to right: CoNaLa, Concode, MBPP. For full results on all evaluation metrics, see the Appendix.
  • Figure 4: Average Code Summarization performance. For each language, see the Appendix.
  • Figure 5: Model generations for a test sample in the MBPP dataset.
  • ...and 10 more figures