Table of Contents
Fetching ...

A Comparative Study on Code Generation with Transformers

Namrata Das, Rakshya Panta, Neelam Karki, Ruchi Manandhar, Dinesh Baniya Kshatri

TL;DR

The paper tackles automated code generation by translating complete program pseudocode into C++ using Transformer architectures of varying complexity. It compares a base Transformer and a CodeT5-small pretrained model, fine-tuned on the SPoC dataset, to assess accuracy and resource efficiency. The study finds CodeT5-small substantially more robust on multi-step problems, achieving higher CodeBLEU and related metrics than the base model, albeit at greater computational cost. The results underscore the importance of dataset size and pretraining in enabling reliable, executable code generation from pseudocode for practical deployment.

Abstract

In an era of widespread influence of Natural Language Processing (NLP), there have been multiple research efforts to supplant traditional manual coding techniques with automated systems capable of generating solutions autonomously. With rapid research for code generation and a sole focus on large language models, there emerges a need to compare and evaluate the performance of transformer architectures based on several complexities of the model. This paper introduces the concept of a "A Comparative Study on Code Generation with Transformers," a model based on Transformer architecture, and NLP methodologies to automatically generate C++ source code for different varieties of problems. Here, a comparative study is performed to evaluate the robustness of transformer-based models on the basis of their architecture complexities and their capability to handle diverse problem sets, from basic arithmetic to complex computations.

A Comparative Study on Code Generation with Transformers

TL;DR

The paper tackles automated code generation by translating complete program pseudocode into C++ using Transformer architectures of varying complexity. It compares a base Transformer and a CodeT5-small pretrained model, fine-tuned on the SPoC dataset, to assess accuracy and resource efficiency. The study finds CodeT5-small substantially more robust on multi-step problems, achieving higher CodeBLEU and related metrics than the base model, albeit at greater computational cost. The results underscore the importance of dataset size and pretraining in enabling reliable, executable code generation from pseudocode for practical deployment.

Abstract

In an era of widespread influence of Natural Language Processing (NLP), there have been multiple research efforts to supplant traditional manual coding techniques with automated systems capable of generating solutions autonomously. With rapid research for code generation and a sole focus on large language models, there emerges a need to compare and evaluate the performance of transformer architectures based on several complexities of the model. This paper introduces the concept of a "A Comparative Study on Code Generation with Transformers," a model based on Transformer architecture, and NLP methodologies to automatically generate C++ source code for different varieties of problems. Here, a comparative study is performed to evaluate the robustness of transformer-based models on the basis of their architecture complexities and their capability to handle diverse problem sets, from basic arithmetic to complex computations.

Paper Structure

This paper contains 9 sections, 7 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Training Dataset Distribution
  • Figure 2: SPoC Dataset Sample
  • Figure 3: Modified Data Sample
  • Figure 4: System Block Diagram
  • Figure 5: Positional Encoding for Dimension 512 and Sequence Length 2048
  • ...and 4 more figures