Specification-Driven Code Translation Powered by Large Language Models: How Far Are We?
Soumit Kanti Saha, Fazle Rabbi, Song Wang, Jinqiu Yang
TL;DR
This paper investigates whether natural-language specifications (NL-specifications) can serve as an intermediate representation to boost large-language-model (LLM) based code translation. It compares two translation pipelines—NL-specification only and NL-specification with source code—against a CodeNet/Avatar EvalPlus baseline using GPT-4 across five target languages and multiple language pairs, with evaluation by $pass@1$ and SonarQube-based quality metrics. The findings show that NL-specification alone does not consistently improve translations, but combining NL-specifications with source code yields improvements for certain language pairs (notably Python and C++ to other languages) and after compilation-error repair; overall, improvements of about 8.5% and 6.1% in corrected translations are reported for the two approaches, respectively. Quality analysis reveals that NL-specification–assisted translations can reduce code-quality warnings in some cases, though C-language translations remain problematic due to complex language semantics and safety concerns. The work highlights both the potential and the limitations of NL-specifications as an intermediate representation for cross-language code translation and suggests future directions to improve NL-spec generation and target-language reliability.
Abstract
Large Language Models (LLMs) are increasingly being applied across various domains, including code-related tasks such as code translation. Previous studies have explored using LLMs for translating code between different programming languages. Since LLMs are more effective with natural language, using natural language as an intermediate representation in code translation tasks presents a promising approach. In this work, we investigate using NL-specification as an intermediate representation for code translation. We evaluate our method using three datasets, five popular programming languages, and 29 language pair permutations. Our results show that using NL-specification alone does not lead to performance improvements. However, when combined with source code, it provides a slight improvement over the baseline in certain language pairs. Besides analyzing the performance of code translation, we also investigate the quality of the translated code and provide insights into the issues present in the translated code.
