LLM Based Long Code Translation using Identifier Replacement
Manojit Chakraborty, Madhusudan Ghosh, Rishabh Gupta
TL;DR
The paper tackles the challenge of translating long code with LLMs constrained by context windows by introducing a zero-shot identifier-replacement technique that substitutes lengthy user-defined identifiers with compact placeholders, thereby reducing token usage and memory requirements. The approach preserves syntactic and hierarchical structure by focusing the model on core code logic and later restoring original identifiers via a reverse mapping, with a formal token-length reduction Δl = $\\sum_{j=1}^{k} (|i_j| - |p_j|)$ guiding the efficiency gains. Experimental evaluation on XcodeEval across multiple languages and LLMs demonstrates token savings and competitive translation fidelity, with more pronounced benefits for procedural languages and larger models. The solution is model-agnostic, cost-efficient, and scalable for industry-scale long code translation, offering a practical path toward robust zero-shot code translation that respects functional integrity while expanding context capacity.
Abstract
In the domain of software development, LLMs have been utilized to automate tasks such as code translation, where source code from one programming language is translated to another while preserving its functionality. However, LLMs often struggle with long source codes that don't fit into the context window, which produces inaccurate translations. To address this, we propose a novel zero-shot code translation method that incorporates identifier replacement. By substituting user-given long identifiers with generalized placeholders during translation, our method allows the LLM to focus on the logical structure of the code, by reducing token count and memory usage, which improves the efficiency and cost-effectiveness of long code translation. Our empirical results demonstrate that our approach preserves syntactical and hierarchical information and produces translation results with reduced tokens.
