Table of Contents
Fetching ...

LLM Based Long Code Translation using Identifier Replacement

Manojit Chakraborty, Madhusudan Ghosh, Rishabh Gupta

TL;DR

The paper tackles the challenge of translating long code with LLMs constrained by context windows by introducing a zero-shot identifier-replacement technique that substitutes lengthy user-defined identifiers with compact placeholders, thereby reducing token usage and memory requirements. The approach preserves syntactic and hierarchical structure by focusing the model on core code logic and later restoring original identifiers via a reverse mapping, with a formal token-length reduction Δl = $\\sum_{j=1}^{k} (|i_j| - |p_j|)$ guiding the efficiency gains. Experimental evaluation on XcodeEval across multiple languages and LLMs demonstrates token savings and competitive translation fidelity, with more pronounced benefits for procedural languages and larger models. The solution is model-agnostic, cost-efficient, and scalable for industry-scale long code translation, offering a practical path toward robust zero-shot code translation that respects functional integrity while expanding context capacity.

Abstract

In the domain of software development, LLMs have been utilized to automate tasks such as code translation, where source code from one programming language is translated to another while preserving its functionality. However, LLMs often struggle with long source codes that don't fit into the context window, which produces inaccurate translations. To address this, we propose a novel zero-shot code translation method that incorporates identifier replacement. By substituting user-given long identifiers with generalized placeholders during translation, our method allows the LLM to focus on the logical structure of the code, by reducing token count and memory usage, which improves the efficiency and cost-effectiveness of long code translation. Our empirical results demonstrate that our approach preserves syntactical and hierarchical information and produces translation results with reduced tokens.

LLM Based Long Code Translation using Identifier Replacement

TL;DR

The paper tackles the challenge of translating long code with LLMs constrained by context windows by introducing a zero-shot identifier-replacement technique that substitutes lengthy user-defined identifiers with compact placeholders, thereby reducing token usage and memory requirements. The approach preserves syntactic and hierarchical structure by focusing the model on core code logic and later restoring original identifiers via a reverse mapping, with a formal token-length reduction Δl = guiding the efficiency gains. Experimental evaluation on XcodeEval across multiple languages and LLMs demonstrates token savings and competitive translation fidelity, with more pronounced benefits for procedural languages and larger models. The solution is model-agnostic, cost-efficient, and scalable for industry-scale long code translation, offering a practical path toward robust zero-shot code translation that respects functional integrity while expanding context capacity.

Abstract

In the domain of software development, LLMs have been utilized to automate tasks such as code translation, where source code from one programming language is translated to another while preserving its functionality. However, LLMs often struggle with long source codes that don't fit into the context window, which produces inaccurate translations. To address this, we propose a novel zero-shot code translation method that incorporates identifier replacement. By substituting user-given long identifiers with generalized placeholders during translation, our method allows the LLM to focus on the logical structure of the code, by reducing token count and memory usage, which improves the efficiency and cost-effectiveness of long code translation. Our empirical results demonstrate that our approach preserves syntactical and hierarchical information and produces translation results with reduced tokens.

Paper Structure

This paper contains 17 sections, 2 equations, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: An overview of the identifier extraction and replacement algorithm for long source code translation. The process involves identifier extraction, classification into syntactic categories, replacement using an identifier mapping strategy, code translation, and final restoration to ensure syntactic and semantic correctness.
  • Figure 2: The figure illustrates the transformation of source code during the identifier replacement (IdRep) process. (a) represents the original source code with long, descriptive identifiers, while (b) shows the modified code where identifiers are replaced with shorter placeholders. This transformation helps reduce token count, enabling LLMs to process longer sequences efficiently while preserving semantic correctness.