Table of Contents
Fetching ...

Enhancing LLM Code Generation Capabilities through Test-Driven Development and Code Interpreter

Sajed Jalil, Shuvo Saha, Hossain Mohammad Seym

TL;DR

The paper addresses the challenge of Bengali code generation with underrepresented language resources by non-finetuning open-weight LLMs using a combined Test-Driven Development (TDD) and Code Interpreter (CI) workflow. It demonstrates that CI+TDD yields substantial accuracy improvements (up to +450%) and near-elimination of compilation errors, with smaller models approaching the performance of larger ones within the same family (up to 98% parity). Bengali→English translation offers little benefit and can sometimes degrade results, whereas the proposed approach enables resource-constrained settings to access high-quality Bengali code generation without fine-tuning or external data augmentation. The work emphasizes practical impact for multilingual code generation and suggests applicability to other underrepresented languages, supported by public GitHub results for reproducibility.

Abstract

Over the past few years, improving LLM code generation capabilities has been a key focus in NLP research. Despite Bengali having 242 million native speakers worldwide, it receives little attention when it comes to training LLMs. More recently, various fine-tuning and augmented generation techniques have been employed to significantly enhance code generation performance. However, they require considerable expertise and resources to utilize effectively as an end user. The goal of our work is to democratize access to powerful code generation tools in resource-constrained emerging markets, enabling users to leverage them in their native language. We introduce a novel approach that combines Test-Driven Development (TDD) and Code Interpreter (CI), utilizing open-weight models, which improves the baseline accuracy for code generation with Bengali prompts and achieves an overall accuracy of 85%. Our approach requires no finetuning and proves that even the smallest models in the same family can attain up to 98% accuracy compared to the largest models. All of our results are publicly shared in GitHub for validation and reproducibility.

Enhancing LLM Code Generation Capabilities through Test-Driven Development and Code Interpreter

TL;DR

The paper addresses the challenge of Bengali code generation with underrepresented language resources by non-finetuning open-weight LLMs using a combined Test-Driven Development (TDD) and Code Interpreter (CI) workflow. It demonstrates that CI+TDD yields substantial accuracy improvements (up to +450%) and near-elimination of compilation errors, with smaller models approaching the performance of larger ones within the same family (up to 98% parity). Bengali→English translation offers little benefit and can sometimes degrade results, whereas the proposed approach enables resource-constrained settings to access high-quality Bengali code generation without fine-tuning or external data augmentation. The work emphasizes practical impact for multilingual code generation and suggests applicability to other underrepresented languages, supported by public GitHub results for reproducibility.

Abstract

Over the past few years, improving LLM code generation capabilities has been a key focus in NLP research. Despite Bengali having 242 million native speakers worldwide, it receives little attention when it comes to training LLMs. More recently, various fine-tuning and augmented generation techniques have been employed to significantly enhance code generation performance. However, they require considerable expertise and resources to utilize effectively as an end user. The goal of our work is to democratize access to powerful code generation tools in resource-constrained emerging markets, enabling users to leverage them in their native language. We introduce a novel approach that combines Test-Driven Development (TDD) and Code Interpreter (CI), utilizing open-weight models, which improves the baseline accuracy for code generation with Bengali prompts and achieves an overall accuracy of 85%. Our approach requires no finetuning and proves that even the smallest models in the same family can attain up to 98% accuracy compared to the largest models. All of our results are publicly shared in GitHub for validation and reproducibility.

Paper Structure

This paper contains 11 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Example of dataset rows used in our study (English instruction is added here for readers' convenience.)
  • Figure 2: Two variants of Bengali to English machine translation.
  • Figure 3: Variants of Test-Driven Development (TDD) approaches in our experiments.
  • Figure 4: Code Interpreter with Test-Driven Development (TDD) approach.
  • Figure 5: Overall accuracy heatmap of models in different approaches.
  • ...and 2 more figures