Table of Contents
Fetching ...

Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval

Jiarong Wu, Songqiang Chen, Jialun Cao, Hau Ching Lo, Shing-Chi Cheung

TL;DR

This work introduces PseudoEval, a multilingual benchmark that uses pseudocode as input to decouple problem-solving from language-coding in LLM code generation. By evaluating across Python, C++, and Rust with an automated pseudocode extraction pipeline, the study shows that problem-solving is the dominant bottleneck in many cases, while language-coding varies by language and improves when guided by pseudocode. The results demonstrate cross-language transfer of problem-solving capabilities, the effectiveness of one-shot prompts and moderate attempt counts, and that automatically generated pseudocode can rival human-authored versions for guiding code generation. The authors also provide a pipeline to extend existing benchmarks and discuss limitations and opportunities for future work to broaden applicability beyond algorithmic tasks and to more languages.

Abstract

Existing code generation benchmarks for Large Language Models (LLMs) such as HumanEval and MBPP are designed to study LLMs' end-to-end performance, where the benchmarks feed a problem description in natural language as input and examine the generated code in specific programming languages. However, the evaluation scores revealed in this way provide a little hint as to the bottleneck of the code generation -- whether LLMs are struggling with their problem-solving capability or language-coding capability. To answer this question, we construct PseudoEval, a multilingual code generation benchmark that provides a solution written in pseudocode as input. By doing so, the bottleneck of code generation in various programming languages could be isolated and identified. Our study yields several interesting findings. For example, we identify that the bottleneck of LLMs in Python programming is problem-solving, while Rust is struggling relatively more in language-coding. Also, our study indicates that problem-solving capability may transfer across programming languages, while language-coding needs more language-specific effort, especially for undertrained programming languages. Finally, we release the pipeline of constructing PseudoEval to facilitate the extension to existing benchmarks. PseudoEval is available at: https://anonymous.4open.science/r/PseudocodeACL25-7B74.

Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval

TL;DR

This work introduces PseudoEval, a multilingual benchmark that uses pseudocode as input to decouple problem-solving from language-coding in LLM code generation. By evaluating across Python, C++, and Rust with an automated pseudocode extraction pipeline, the study shows that problem-solving is the dominant bottleneck in many cases, while language-coding varies by language and improves when guided by pseudocode. The results demonstrate cross-language transfer of problem-solving capabilities, the effectiveness of one-shot prompts and moderate attempt counts, and that automatically generated pseudocode can rival human-authored versions for guiding code generation. The authors also provide a pipeline to extend existing benchmarks and discuss limitations and opportunities for future work to broaden applicability beyond algorithmic tasks and to more languages.

Abstract

Existing code generation benchmarks for Large Language Models (LLMs) such as HumanEval and MBPP are designed to study LLMs' end-to-end performance, where the benchmarks feed a problem description in natural language as input and examine the generated code in specific programming languages. However, the evaluation scores revealed in this way provide a little hint as to the bottleneck of the code generation -- whether LLMs are struggling with their problem-solving capability or language-coding capability. To answer this question, we construct PseudoEval, a multilingual code generation benchmark that provides a solution written in pseudocode as input. By doing so, the bottleneck of code generation in various programming languages could be isolated and identified. Our study yields several interesting findings. For example, we identify that the bottleneck of LLMs in Python programming is problem-solving, while Rust is struggling relatively more in language-coding. Also, our study indicates that problem-solving capability may transfer across programming languages, while language-coding needs more language-specific effort, especially for undertrained programming languages. Finally, we release the pipeline of constructing PseudoEval to facilitate the extension to existing benchmarks. PseudoEval is available at: https://anonymous.4open.science/r/PseudocodeACL25-7B74.

Paper Structure

This paper contains 26 sections, 21 figures, 5 tables.

Figures (21)

  • Figure 1: Motivating example
  • Figure 2: Workflow of constructing the PseudoEval dataset and empirical study
  • Figure 3: Zero-/one-shot Pass@{1,5,10} rates of C++ programs generated from pseudocode of C++ solutions
  • Figure 4: Pass@k of code generation from pseudocode from C++ to all languages, compared with direct generation from problems
  • Figure 5: Pass@k of code generation from pseudocode from Python to all languages, compared with direct generation from problems
  • ...and 16 more figures