Table of Contents
Fetching ...

Post-Incorporating Code Structural Knowledge into Pretrained Models via ICL for Code Translation

Yali Du, Hui Sun, Ming Li

TL;DR

This work tackles code translation by addressing the gap that code syntactic structure is underutilized by pretrained LLMs. It introduces CAST, a surrogate information-coverage score based on Abstract Syntax Tree (AST) subtrees, and shows CAST maximization is NP-hard but admits a greedy (1-1/e)-approximation via submodular optimization. The method is training-free and model-agnostic, enabling post-incorporation of code structure during inference by selecting a small exemplar set (CAST-F) or a dynamic set (CAST-A) and appending them to the test input. Empirical results across multiple LLMs, datasets, and tasks demonstrate substantial performance gains in code translation and even code summarization, while also revealing that simply scaling model size or data does not induce sufficient code-structure knowledge without explicit consideration of syntax. The work highlights CAST as a general, robust framework for syntax-aware knowledge transfer in code-related NLP tasks with practical implications for tooling and software maintenance.

Abstract

Code translation migrates codebases across programming languages. Recently, large language models (LLMs) have achieved significant advancements in software mining. However, handling the syntactic structure of source code remains a challenge. Classic syntax-aware methods depend on intricate model architectures and loss functions, rendering their integration into LLM training resource-intensive. This paper employs in-context learning (ICL), which directly integrates task exemplars into the input context, to post-incorporate code structural knowledge into pre-trained LLMs. We revisit exemplar selection in ICL from an information-theoretic perspective, proposing that list-wise selection based on information coverage is more precise and general objective than traditional methods based on combining similarity and diversity. To address the challenges of quantifying information coverage, we introduce a surrogate measure, Coverage of Abstract Syntax Tree (CAST). Furthermore, we formulate the NP-hard CAST maximization for exemplar selection and prove that it is a standard submodular maximization problem. Therefore, we propose a greedy algorithm for CAST submodular maximization, which theoretically guarantees a (1-1/e)-approximate solution in polynomial time complexity. Our method is the first training-free and model-agnostic approach to post-incorporate code structural knowledge into existing LLMs at test time. Experimental results show that our method significantly improves LLMs performance and reveals two meaningful insights: 1) Code structural knowledge can be effectively post-incorporated into pre-trained LLMs during inference, despite being overlooked during training; 2) Scaling up model size or training data does not lead to the emergence of code structural knowledge, underscoring the necessity of explicitly considering code syntactic structure.

Post-Incorporating Code Structural Knowledge into Pretrained Models via ICL for Code Translation

TL;DR

This work tackles code translation by addressing the gap that code syntactic structure is underutilized by pretrained LLMs. It introduces CAST, a surrogate information-coverage score based on Abstract Syntax Tree (AST) subtrees, and shows CAST maximization is NP-hard but admits a greedy (1-1/e)-approximation via submodular optimization. The method is training-free and model-agnostic, enabling post-incorporation of code structure during inference by selecting a small exemplar set (CAST-F) or a dynamic set (CAST-A) and appending them to the test input. Empirical results across multiple LLMs, datasets, and tasks demonstrate substantial performance gains in code translation and even code summarization, while also revealing that simply scaling model size or data does not induce sufficient code-structure knowledge without explicit consideration of syntax. The work highlights CAST as a general, robust framework for syntax-aware knowledge transfer in code-related NLP tasks with practical implications for tooling and software maintenance.

Abstract

Code translation migrates codebases across programming languages. Recently, large language models (LLMs) have achieved significant advancements in software mining. However, handling the syntactic structure of source code remains a challenge. Classic syntax-aware methods depend on intricate model architectures and loss functions, rendering their integration into LLM training resource-intensive. This paper employs in-context learning (ICL), which directly integrates task exemplars into the input context, to post-incorporate code structural knowledge into pre-trained LLMs. We revisit exemplar selection in ICL from an information-theoretic perspective, proposing that list-wise selection based on information coverage is more precise and general objective than traditional methods based on combining similarity and diversity. To address the challenges of quantifying information coverage, we introduce a surrogate measure, Coverage of Abstract Syntax Tree (CAST). Furthermore, we formulate the NP-hard CAST maximization for exemplar selection and prove that it is a standard submodular maximization problem. Therefore, we propose a greedy algorithm for CAST submodular maximization, which theoretically guarantees a (1-1/e)-approximate solution in polynomial time complexity. Our method is the first training-free and model-agnostic approach to post-incorporate code structural knowledge into existing LLMs at test time. Experimental results show that our method significantly improves LLMs performance and reveals two meaningful insights: 1) Code structural knowledge can be effectively post-incorporated into pre-trained LLMs during inference, despite being overlooked during training; 2) Scaling up model size or training data does not lead to the emergence of code structural knowledge, underscoring the necessity of explicitly considering code syntactic structure.

Paper Structure

This paper contains 21 sections, 3 theorems, 32 equations, 8 figures, 6 tables, 2 algorithms.

Key Result

Lemma 2

Let $A, B \in \{0,1\}^m$. Then, the following holds: where $\| \cdot \|_1$ denotes the $\ell_1$-norm.

Figures (8)

  • Figure 1: Revisiting ICL exemplar selection from an information-theoretic perspective; CAST for code translation.
  • Figure 2: Overview of post-incorporating code syntactic knowledge into pre-trained models using ICL based on CAST retrieval.
  • Figure 3: The performance of Qwen2-7B with various strategies and shot count.
  • Figure 4: Comparison of the average coverage of abstract syntax tree (CAST) between the CAST-F and Levenshtein distance (LD) strategies.
  • Figure 5: Sensitivity analysis of hyper-parameters in the average CA of Qwen2-7B for the selection size $k$ and the pre-recalled candidate size $\lfloor t \cdot k \rfloor$.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Definition 1: Submodularity nemhauser1978best
  • Lemma 2: Properties of $\ell_1$-norm in boolean vector operations
  • proof
  • Proposition 3: The value function for $\mathrm{CAST}$ maximization is non-negative, monotone, and submodular.
  • proof
  • Theorem 4: $(1-1/e)$-approximation of greedy submodular maximization nemhauser1978best