Can Emulating Semantic Translation Help LLMs with Code Translation? A Study Based on Pseudocode
Songqiang Chen, Congying Xu, Jingyi Chen, Jialun Cao, Jiarong Wu, Shing-Chi Cheung
TL;DR
This study investigates whether emulating semantic translation through pseudocode can improve LLM-based code translation compared with direct single-step translation. It conducts an extensive empirical evaluation across six languages, 323 LeetCode problems, and five LLMs, testing five prompting strategies and reporting execution-based pass@10 results. The findings show that hybrid strategies that combine direct translation with pseudocode guidance provide systematic accuracy gains, particularly for flexible-to-rigid language pairs and low-resource Rust, with larger benefits on harder tasks. High-quality pseudocode substantially boosts performance, highlighting pseudocode quality as a key bottleneck, while case studies reveal both the advantages and limitations of this approach. The work recommends hybrid, strategy-aware use of pseudocode to enhance code translation accuracy and points to future directions in pseudocode generation, validation, and automatic strategy selection.
Abstract
Large language models (LLMs) show great potential in code translation. However, accurate translation remains challenging when using the commonly adopted direct code-to-code translation approach, which converts a program into the target programming language (PL) in a single step. Inspired by the success of incorporating intermediate steps to guide LLMs in resolving challenging tasks, we explore pseudocode-based code translation, which emulates the human semantic translation by first interpreting the program's intent and logic into pseudocode and then implementing it in the target PL. We find that pseudocode-based translation helps translate programs that direct translation struggles to handle. Nonetheless, the effectiveness, advantages, and limitations of this approach remain underexplored. To bridge this gap, we present an empirical study on pseudocode-based code translation, aiming to investigate its effectiveness in enhancing the direct translation approach, illuminate its effective usage, and identify limitations hindering its potential benefits. By comparing direct and pseudocode-based translation approaches on 9,690 translation tasks across six PLs with five popular LLMs, we demonstrate that pseudocode-based translation can effectively complement direct translation, particularly when translating from flexible to rigid PLs or dealing with low-resource Rust. Based on these findings, we suggest adopting strategies that combine the complementary strengths of both approaches to enhance code translation accuracy. We also reveal the advantages of pseudocode-based translation in disentangling translations of complicated programs and mitigating distractions from detailed implementations in original programs, as well as its limitations due to incorrect, incomplete, or ambiguous pseudocode.
