Table of Contents
Fetching ...

Assessing Code Generation with Intermediate Languages

Xun Deng, Sicheng Zhong, Honghua Dong, Jingyu Hu, Sidi Mohamed Beillahi, Xujie Si, Fan Long

TL;DR

Explores whether intermediate-step prompts can boost LLM code generation and motivates a two-stage prompting framework using intermediate representations. It evaluates five intermediate representations (natural language, pseudo-code, and code in five target languages) across 11 models on the multilingual HumanEval-X benchmark, employing a three-part experimental design. Results show natural language intermediates consistently yield the strongest gains in larger models, with no single intermediate language universally effective; correlations between intermediate correctness and final code are weak, suggesting a chain-of-thought effect, while repetitive prompting notably benefits GPT-family models. The findings inform prompting strategy design for cross-language code generation and highlight the role of model size and reasoning cues in leveraging intermediate representations.

Abstract

Intermediate step methodologies like chain of thoughts (COT) have demonstrated effectiveness in enhancing the performance of Large Language Models (LLMs) on code generation. This study explores the utilization of intermediate languages, including various programming languages, natural language solutions, and pseudo-code, and systematically evaluates their impact on the performance of LLMs in code generation tasks. Our experiments encompass eleven models across the CodeLlama, GPT, and Mistral families, as well as newly released smaller models. Our findings reveal that intermediate languages generally exhibit greater efficacy in larger models that have not yet achieved state-of-the-art performance. Natural language consistently emerges as the most effective intermediate representation across all target languages. However, we observe no universally effective intermediate formal language across different models and target languages. Furthermore, we uncover a weak correlation between the correctness of intermediate solutions and final generation, suggesting that improvements may stem from the chain-of-thought effect rather than language-specific transfer. Interestingly, we discover that for GPT family models, prompting multiple times without explicit self-correction instructions yields performance gains across the studied languages.

Assessing Code Generation with Intermediate Languages

TL;DR

Explores whether intermediate-step prompts can boost LLM code generation and motivates a two-stage prompting framework using intermediate representations. It evaluates five intermediate representations (natural language, pseudo-code, and code in five target languages) across 11 models on the multilingual HumanEval-X benchmark, employing a three-part experimental design. Results show natural language intermediates consistently yield the strongest gains in larger models, with no single intermediate language universally effective; correlations between intermediate correctness and final code are weak, suggesting a chain-of-thought effect, while repetitive prompting notably benefits GPT-family models. The findings inform prompting strategy design for cross-language code generation and highlight the role of model size and reasoning cues in leveraging intermediate representations.

Abstract

Intermediate step methodologies like chain of thoughts (COT) have demonstrated effectiveness in enhancing the performance of Large Language Models (LLMs) on code generation. This study explores the utilization of intermediate languages, including various programming languages, natural language solutions, and pseudo-code, and systematically evaluates their impact on the performance of LLMs in code generation tasks. Our experiments encompass eleven models across the CodeLlama, GPT, and Mistral families, as well as newly released smaller models. Our findings reveal that intermediate languages generally exhibit greater efficacy in larger models that have not yet achieved state-of-the-art performance. Natural language consistently emerges as the most effective intermediate representation across all target languages. However, we observe no universally effective intermediate formal language across different models and target languages. Furthermore, we uncover a weak correlation between the correctness of intermediate solutions and final generation, suggesting that improvements may stem from the chain-of-thought effect rather than language-specific transfer. Interestingly, we discover that for GPT family models, prompting multiple times without explicit self-correction instructions yields performance gains across the studied languages.
Paper Structure (19 sections, 1 figure, 11 tables)

This paper contains 19 sections, 1 figure, 11 tables.

Figures (1)

  • Figure 1: Comparison between the flow of standard prompting and the flow of intermediate-target prompting method. In the intermediate-target prompting, we first prompt the model to generate code in an intermediate language with the task description only. Then, we prompt the model to generate a solution in the target language.