Table of Contents
Fetching ...

TIT: A Tree-Structured Instruction Tuning Approach for LLM-Based Code Translation

He Jiang, Yufu Wang, Hao Lin, Peiyu Zou, Zhide Zhou, Ang Jia, Xiaochen Li, Zhilei Ren

TL;DR

This work tackles two central challenges in LLM-driven code translation: syntactic confusion from source-language features and semantic misalignment at a fine-grained level. It introduces TIT, a three-module framework consisting of a syntactic information representation module, a fine-grained parallel dataset augmentation module, and a dual-stage tree instruction tuning module to ground syntax and semantics in a language-agnostic representation and refine translation through staged learning. Empirical results on Python→Java translation across multiple LLMs show significant improvements in translation success rates and substantial reductions in syntactic confusion, surpassing several specialized code-translation baselines and approaching larger general-purpose models. The findings demonstrate TIT’s effectiveness in leveraging structured syntactic information to enhance cross-language code generation and point to its potential for broader multilingual code translation tasks and real-world software migration scenarios.

Abstract

Large Language Models (LLMs) have shown strong performance in automated source-to-target code translation through pretraining on extensive code corpora. However, mainstream LLM-based code translation methods suffer from two critical limitations. First, they are highly sensitive to language-specific features, which often introduce source-language syntax or lexicon into the output, leading to syntactic confusion. Second, they lack fine-grained semantic alignment due to an over-reliance on function-level parallel datasets, resulting in semantic misalignment between the translated code and the original source. To overcome these limitations, we propose TIT, a Tree-structured Instruction Tuning paradigm for LLM-based code translation. Specifically, TIT consists of three modules. First, to mitigate syntactic confusion, the syntactic information representation module integrates language-agnostic syntactic features via structured parsing. Then, to generate high-quality fine-grained parallel data, the fine-grained parallel dataset augmentation module aligns nodes with code segments through statement-level segmentation and contrastive matching. Finally, we leverage the dual-stage tree instruction tuning module to alleviate the contextual processing burden on the LLM caused by the introduction of syntactic information. The first stage employs syntax-aware fine-tuning to enable the LLM to autonomously comprehend structured syntactic information, while the second stage utilizes code generation fine-tuning to guide the model in generating accurate target code based on function-level syntactic dependencies. The experimental results demonstrate that the proposed method significantly outperforms existing approaches in multiple LLMs, achieving a success rate 1.22x-1.75x higher in code translation while markedly reducing syntactic confusion.

TIT: A Tree-Structured Instruction Tuning Approach for LLM-Based Code Translation

TL;DR

This work tackles two central challenges in LLM-driven code translation: syntactic confusion from source-language features and semantic misalignment at a fine-grained level. It introduces TIT, a three-module framework consisting of a syntactic information representation module, a fine-grained parallel dataset augmentation module, and a dual-stage tree instruction tuning module to ground syntax and semantics in a language-agnostic representation and refine translation through staged learning. Empirical results on Python→Java translation across multiple LLMs show significant improvements in translation success rates and substantial reductions in syntactic confusion, surpassing several specialized code-translation baselines and approaching larger general-purpose models. The findings demonstrate TIT’s effectiveness in leveraging structured syntactic information to enhance cross-language code generation and point to its potential for broader multilingual code translation tasks and real-world software migration scenarios.

Abstract

Large Language Models (LLMs) have shown strong performance in automated source-to-target code translation through pretraining on extensive code corpora. However, mainstream LLM-based code translation methods suffer from two critical limitations. First, they are highly sensitive to language-specific features, which often introduce source-language syntax or lexicon into the output, leading to syntactic confusion. Second, they lack fine-grained semantic alignment due to an over-reliance on function-level parallel datasets, resulting in semantic misalignment between the translated code and the original source. To overcome these limitations, we propose TIT, a Tree-structured Instruction Tuning paradigm for LLM-based code translation. Specifically, TIT consists of three modules. First, to mitigate syntactic confusion, the syntactic information representation module integrates language-agnostic syntactic features via structured parsing. Then, to generate high-quality fine-grained parallel data, the fine-grained parallel dataset augmentation module aligns nodes with code segments through statement-level segmentation and contrastive matching. Finally, we leverage the dual-stage tree instruction tuning module to alleviate the contextual processing burden on the LLM caused by the introduction of syntactic information. The first stage employs syntax-aware fine-tuning to enable the LLM to autonomously comprehend structured syntactic information, while the second stage utilizes code generation fine-tuning to guide the model in generating accurate target code based on function-level syntactic dependencies. The experimental results demonstrate that the proposed method significantly outperforms existing approaches in multiple LLMs, achieving a success rate 1.22x-1.75x higher in code translation while markedly reducing syntactic confusion.

Paper Structure

This paper contains 28 sections, 7 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: A motivation example demonstrates the TIT's potential to mitigate the syntactic confusion. (a) Evaluation on StarCoder2-7B without TIT. (b) Evaluation on StarCoder2-7B with TIT.
  • Figure 2: Syntactic confusion across methods on the HumanEval-X Dataset.
  • Figure 3: An example of line-efficiency non-equivalence in expression density within function-level parallel datasets.
  • Figure 4: Workflow of TIT.
  • Figure 5: An example of processing the AST node.
  • ...and 9 more figures