Table of Contents
Fetching ...

CoTexT: Multi-task Learning with Code-Text Transformer

Long Phan, Hieu Tran, Daniel Le, Hieu Nguyen, James Anibal, Alec Peltekian, Yanfang Ye

TL;DR

CoTexT introduces a T5-based encoder-decoder pre-trained on bimodal NL-code and unimodal code data to learn robust NL-PL representations across multiple programming languages. By framing NL-PL tasks as text-to-text and employing task-specific prefixes, it achieves state-of-the-art results on CodeXGLUE tasks, including CodeSummarization, CodeGeneration, CodeRefinement, and Defect Detection, outperforming prior models like CodeBERT and PLBART. The work demonstrates the versatility of large-scale, cross-modal pretraining for diverse code intelligence tasks and provides public checkpoints for future research. Overall, CoTexT advances NL-PL understanding and generation, with strong implications for real-world code understanding, documentation, and repair tasks.

Abstract

We present CoTexT, a pre-trained, transformer-based encoder-decoder model that learns the representative context between natural language (NL) and programming language (PL). Using self-supervision, CoTexT is pre-trained on large programming language corpora to learn a general understanding of language and code. CoTexT supports downstream NL-PL tasks such as code summarizing/documentation, code generation, defect detection, and code debugging. We train CoTexT on different combinations of available PL corpus including both "bimodal" and "unimodal" data. Here, bimodal data is the combination of text and corresponding code snippets, whereas unimodal data is merely code snippets. We first evaluate CoTexT with multi-task learning: we perform Code Summarization on 6 different programming languages and Code Refinement on both small and medium size featured in the CodeXGLUE dataset. We further conduct extensive experiments to investigate CoTexT on other tasks within the CodeXGlue dataset, including Code Generation and Defect Detection. We consistently achieve SOTA results in these tasks, demonstrating the versatility of our models.

CoTexT: Multi-task Learning with Code-Text Transformer

TL;DR

CoTexT introduces a T5-based encoder-decoder pre-trained on bimodal NL-code and unimodal code data to learn robust NL-PL representations across multiple programming languages. By framing NL-PL tasks as text-to-text and employing task-specific prefixes, it achieves state-of-the-art results on CodeXGLUE tasks, including CodeSummarization, CodeGeneration, CodeRefinement, and Defect Detection, outperforming prior models like CodeBERT and PLBART. The work demonstrates the versatility of large-scale, cross-modal pretraining for diverse code intelligence tasks and provides public checkpoints for future research. Overall, CoTexT advances NL-PL understanding and generation, with strong implications for real-world code understanding, documentation, and repair tasks.

Abstract

We present CoTexT, a pre-trained, transformer-based encoder-decoder model that learns the representative context between natural language (NL) and programming language (PL). Using self-supervision, CoTexT is pre-trained on large programming language corpora to learn a general understanding of language and code. CoTexT supports downstream NL-PL tasks such as code summarizing/documentation, code generation, defect detection, and code debugging. We train CoTexT on different combinations of available PL corpus including both "bimodal" and "unimodal" data. Here, bimodal data is the combination of text and corresponding code snippets, whereas unimodal data is merely code snippets. We first evaluate CoTexT with multi-task learning: we perform Code Summarization on 6 different programming languages and Code Refinement on both small and medium size featured in the CodeXGLUE dataset. We further conduct extensive experiments to investigate CoTexT on other tasks within the CodeXGlue dataset, including Code Generation and Defect Detection. We consistently achieve SOTA results in these tasks, demonstrating the versatility of our models.

Paper Structure

This paper contains 28 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: An illustration about Fill-in-the-blank objective
  • Figure 2: An illustration about Multi-task learning