Table of Contents
Fetching ...

UniGenCoder: Merging Seq2Seq and Seq2Tree Paradigms for Unified Code Generation

Liangying Shao, Yanfu Yan, Denys Poshyvanyk, Jinsong Su

TL;DR

UniGenCoder addresses the fragmentation between Seq2Seq and Seq2Tree code generation by proposing a unified Transformer-based model with a shared encoder, a minimally extended decoder, and a dynamic paradigm selector. The training pipeline combines multi-task learning and distillation to transfer knowledge between paradigms, followed by contrastive learning to fine-tune the selector, all within a CodeT5 backbone. Empirical results on text-to-code and code-to-code tasks show consistent gains over strong Seq2Seq baselines and competitive performance versus Seq2Tree baselines, while maintaining superior efficiency. The approach is backbone-agnostic and poised to scale to larger open-source LLMs, offering a practical path to exploit both generation strategies for broader code-related tasks.

Abstract

Deep learning-based code generation has completely transformed the way developers write programs today. Existing approaches to code generation have focused either on the Sequence-to-Sequence paradigm, which generates target code as a sequence of tokens, or the Sequence-to-Tree paradigm, which outputs code as a sequence of actions. While these two paradigms are intuitively complementary, their combination has not been previously explored. By comparing the code generated under these two paradigms, we find that integrating them holds significant potential. In this paper, we propose UniGenCoder for code-related generation tasks, which consists of a shared encoder, a shared decoder with a minimal set of additional parameters to unify two paradigms, and a selector that dynamically chooses optimal paradigm for each instance. Also, during the model training, we first perform the multi-task learning and distillation strategies to facilitate knowledge transfer between two paradigms, and then leverage contrastive learning to train the selector. Experimental results on the text-to-code and code-to-code generation tasks demonstrate the effectiveness of our proposed model. We release our code at https://github.com/DeepLearnXMU/UniGenCoder.

UniGenCoder: Merging Seq2Seq and Seq2Tree Paradigms for Unified Code Generation

TL;DR

UniGenCoder addresses the fragmentation between Seq2Seq and Seq2Tree code generation by proposing a unified Transformer-based model with a shared encoder, a minimally extended decoder, and a dynamic paradigm selector. The training pipeline combines multi-task learning and distillation to transfer knowledge between paradigms, followed by contrastive learning to fine-tune the selector, all within a CodeT5 backbone. Empirical results on text-to-code and code-to-code tasks show consistent gains over strong Seq2Seq baselines and competitive performance versus Seq2Tree baselines, while maintaining superior efficiency. The approach is backbone-agnostic and poised to scale to larger open-source LLMs, offering a practical path to exploit both generation strategies for broader code-related tasks.

Abstract

Deep learning-based code generation has completely transformed the way developers write programs today. Existing approaches to code generation have focused either on the Sequence-to-Sequence paradigm, which generates target code as a sequence of tokens, or the Sequence-to-Tree paradigm, which outputs code as a sequence of actions. While these two paradigms are intuitively complementary, their combination has not been previously explored. By comparing the code generated under these two paradigms, we find that integrating them holds significant potential. In this paper, we propose UniGenCoder for code-related generation tasks, which consists of a shared encoder, a shared decoder with a minimal set of additional parameters to unify two paradigms, and a selector that dynamically chooses optimal paradigm for each instance. Also, during the model training, we first perform the multi-task learning and distillation strategies to facilitate knowledge transfer between two paradigms, and then leverage contrastive learning to train the selector. Experimental results on the text-to-code and code-to-code generation tasks demonstrate the effectiveness of our proposed model. We release our code at https://github.com/DeepLearnXMU/UniGenCoder.

Paper Structure

This paper contains 15 sections, 4 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: The overall architecture of our UniGenCoder model. The token embedding matrix and linear layer for token prediction are designed for the Seq2Seq paradigm, while the action embedding matrix and linear layer for action prediction are tailored to the Seq2Tree paradigm. Note that the token embedding matrix is included in the action embedding matrix, as output actions in the Seq2Tree paradigm can be either tokens or rules. Likewise, the linear layer for token prediction is contained by the linear layer for action prediction.
  • Figure 2: Our proposed distillation strategy. $\theta, \theta_{s2s}, \theta_{s2t}$ represent the parameters of UniGenCoder, CodeT5(Seq2Seq) and CodeT5(Seq2Tree), respectively.