Table of Contents
Fetching ...

TransCoder: Towards Unified Transferable Code Representation Learning Inspired by Human Skills

Qiushi Sun, Nuo Chen, Jianing Wang, Xiang Li, Ming Gao

TL;DR

TransCoder introduces a unified transferable framework for code representation learning that captures cross-task and cross-language knowledge through a tunable universal knowledge prefix. The method operates in two stages: source task training to absorb universal knowledge via continual learning, and target task specification to apply the learned prefix to new CodePTMs. Empirical results on CodeXGLUE show consistent improvements across code generation and understanding tasks, with pronounced gains in low-resource languages and data-scarce tasks, and ablation analyses confirm the effectiveness of the universal knowledge prefix. This approach reduces data imbalance effects and enables mutual reinforcement among tasks and languages, offering practical benefits for real-world code intelligence under limited data scenarios.

Abstract

Code pre-trained models (CodePTMs) have recently demonstrated a solid capacity to process various software intelligence tasks, e.g., code clone detection, code translation, and code summarization. The current mainstream method that deploys these models to downstream tasks is to fine-tune them on individual tasks, which is generally costly and needs sufficient data for large models. To tackle the issue, in this paper, we present TransCoder, a unified Transferable fine-tuning strategy for Code representation learning. Inspired by human inherent skills of knowledge generalization, TransCoder drives the model to learn better code-related meta-knowledge like human programmers. Specifically, we employ a tunable prefix encoder as the meta-learner to capture cross-task and cross-language transferable knowledge, respectively. Besides, tasks with minor training sample sizes and languages with small corpus can be remarkably benefited from our approach. Extensive experiments conducted on benchmark datasets clearly demonstrate that our method can lead to superior performance on various code-related tasks and encourage mutual reinforcement. We also show that TransCoder is applicable in low-resource scenarios. Our codes are available at https://github.com/QiushiSun/TransCoder.

TransCoder: Towards Unified Transferable Code Representation Learning Inspired by Human Skills

TL;DR

TransCoder introduces a unified transferable framework for code representation learning that captures cross-task and cross-language knowledge through a tunable universal knowledge prefix. The method operates in two stages: source task training to absorb universal knowledge via continual learning, and target task specification to apply the learned prefix to new CodePTMs. Empirical results on CodeXGLUE show consistent improvements across code generation and understanding tasks, with pronounced gains in low-resource languages and data-scarce tasks, and ablation analyses confirm the effectiveness of the universal knowledge prefix. This approach reduces data imbalance effects and enables mutual reinforcement among tasks and languages, offering practical benefits for real-world code intelligence under limited data scenarios.

Abstract

Code pre-trained models (CodePTMs) have recently demonstrated a solid capacity to process various software intelligence tasks, e.g., code clone detection, code translation, and code summarization. The current mainstream method that deploys these models to downstream tasks is to fine-tune them on individual tasks, which is generally costly and needs sufficient data for large models. To tackle the issue, in this paper, we present TransCoder, a unified Transferable fine-tuning strategy for Code representation learning. Inspired by human inherent skills of knowledge generalization, TransCoder drives the model to learn better code-related meta-knowledge like human programmers. Specifically, we employ a tunable prefix encoder as the meta-learner to capture cross-task and cross-language transferable knowledge, respectively. Besides, tasks with minor training sample sizes and languages with small corpus can be remarkably benefited from our approach. Extensive experiments conducted on benchmark datasets clearly demonstrate that our method can lead to superior performance on various code-related tasks and encourage mutual reinforcement. We also show that TransCoder is applicable in low-resource scenarios. Our codes are available at https://github.com/QiushiSun/TransCoder.
Paper Structure (34 sections, 2 equations, 5 figures, 11 tables, 1 algorithm)

This paper contains 34 sections, 2 equations, 5 figures, 11 tables, 1 algorithm.

Figures (5)

  • Figure 1: (a) A CodePTM (e.g., CodeT5, PLBART) will learn through a series of code downstream tasks such as code summarization and clone detection in the learning process, in order to acquire cross-task knowledge of code representation. (b) In the currently available code corpora (both bimodal and unimodal data), there is an imbalance between different PLs. Nonetheless, different languages share similar programming principles so that they can "support" each other through the models' learning cross-language knowledge. (Best viewed in color.)
  • Figure 2: An illustration of the architecture of TransCoder. (1) In the source task training stage, tunable universal knowledge prefixes are first randomly initialized and prepended with a CodePTM (e.g., CodeT5, PLBART). The whole model is tuned by back-propagation. (2) For the target tasks specification stage, we prepend these universal knowledge prefixes to a new CodePTM, effectively infusing universal knowledge into the model. For brevity, we choose a cross-task scenario and use some representative tasks as illustrations in this figure, which means using the knowledge acquired from code summarization/defect detection to enhance the performance of code translation/clone detection. The order of tasks or languages could be rearranged flexibly. (Best viewed in color.)
  • Figure 3: Comparison of code defect detection task between fine-tuning and TransCoder with the knowledge of code summarization (using PLBART backbone).
  • Figure 4: Comparison between TransCoder (with code understanding knowledge) and fine-tuning on summarizing Java code based on CodeT5 backbone.
  • Figure 5: Employing different source tasks training order, refer Table \ref{['tab:task-order']} for details.