Multi-Programming Language Ensemble for Code Generation in Large Language Model
Tengfei Xue, Xuefeng Li, Tahir Azim, Roman Smirnov, Jianhui Yu, Arash Sadrieh, Babak Pahlavan
TL;DR
The paper addresses the limitation of single-language focus in LLM-based code generation by introducing Multi-Programming Language Ensemble (MPLE). MPLE treats code produced in different programming languages as complementary weak experts and iteratively refines solutions through cross-language transformations, translation back to a primary language, and ensemble integration. It further integrates with established inference techniques like the reflection algorithm and Monte Carlo Tree Search to enhance robustness and search over code paths. Empirical results on HumanEval and HumanEval-plus show consistent improvements across a range of models, with notable gains such as a 96.25% Pass@1 on HumanEval for certain LLMs, indicating state-of-the-art performance and strong generalization potential for multi-language code generation in practical settings.
Abstract
Large language models (LLMs) have significantly improved code generation, particularly in one-pass code generation. However, most existing approaches focus solely on generating code in a single programming language, overlooking the potential of leveraging the multi-language capabilities of LLMs. LLMs have varying patterns of errors across different languages, suggesting that a more robust approach could be developed by leveraging these multi-language outputs. In this study, we propose Multi-Programming Language Ensemble (MPLE), a novel ensemble-based method that utilizes code generation across multiple programming languages to enhance overall performance. By treating each language-specific code generation process as an individual "weak expert" and effectively integrating their outputs, our method mitigates language-specific errors and biases. This multi-language ensemble strategy leverages the complementary strengths of different programming languages, enabling the model to produce more accurate and robust code. Our approach can be seamlessly integrated with commonly used techniques such as the reflection algorithm and Monte Carlo tree search to improve code generation quality further. Experimental results show that our framework consistently enhances baseline performance by up to 17.92% on existing benchmarks (HumanEval and HumanEval-plus), with a standout result of 96.25% accuracy on the HumanEval benchmark, achieving new state-of-the-art results across various LLM models. The code will be released at https://github.com/NinjaTech-AI/MPLE
