Structured Code Representations Enable Data-Efficient Adaptation of Code Language Models
Mayank Agarwal, Yikang Shen, Bailin Wang, Yoon Kim, Jie Chen
TL;DR
This work tackles the inefficiency of code-language models that treat code as plain text by leveraging explicit program structure. It presents a plug-and-play approach that serializes concrete syntax trees (CSTs) and continues pre-training and fine-tuning of pre-trained models using CSTs, without architectural changes. It introduces MSP, MNP, and TeTr/TrTe objectives for encoder-decoder models, and applies causal LM training for decoder-only models on serialized CSTs. Across code translation, generation, and summarization tasks, the method yields substantial gains in low-data scenarios, demonstrating that integrating structure with text markedly improves data efficiency and robustness while preserving scalability. The results highlight practical impact for low-resource languages and domains, with significant reductions in erroneous translations and improved semantic correctness.
Abstract
Current language models tailored for code tasks often adopt the pre-training-then-fine-tuning paradigm from natural language processing, modeling source code as plain text. This approach, however, overlooks the unambiguous structures inherent in programming languages. In this work, we explore data-efficient adaptation of pre-trained code models by further pre-training and fine-tuning them with program structures. Specifically, we represent programs as parse trees -- also known as concrete syntax trees (CSTs) -- and adapt pre-trained models on serialized CSTs. Although the models that we adapt have been pre-trained only on the surface form of programs, we find that a small amount of continual pre-training and fine-tuning on CSTs without changing the model architecture yields improvements over the baseline approach across various code tasks. The improvements are found to be particularly significant when there are limited training examples, demonstrating the effectiveness of integrating program structures with plain-text representation even when working with backbone models that have not been pre-trained with structures.
