Enhancing Cross-Language Code Translation via Task-Specific Embedding Alignment in Retrieval-Augmented Generation
Manish Bhattarai, Minh Vu, Javier E. Santos, Ismael Boureima, Daniel O' Malley
TL;DR
The paper tackles the challenge of cross-language code translation from Fortran to C++ by introducing task-specific embedding alignment within a Retrieval-Augmented Generation (RAG) framework. It trains a Fortran embedding model with a Soft Information Noise-Contrastive Estimation (S-InfoNCE) loss guided by CodeBLEU similarities between generated translations, aligning retrieval to improve downstream translation quality without fine-tuning the LLM. Empirical results on HPC Fortran2CPP and Numerical Recipes show substantial CodeBLEU gains (e.g., 0.64→0.73 and 0.52→0.60 respectively), demonstrating the approach's effectiveness and practical efficiency across multiple model sizes. The method leverages a 25k-Fortran snippet dataset from Stack-V2 and translations produced by LLaMA 3.1-8B, highlighting the potential to boost code translation performance in resource-constrained settings and extending to other language pairs and task-specific metrics.
Abstract
We introduce a novel method to enhance cross-language code translation from Fortran to C++ by integrating task-specific embedding alignment into a Retrieval-Augmented Generation (RAG) framework. Unlike conventional retrieval approaches that utilize generic embeddings agnostic to the downstream task, our strategy aligns the retrieval model directly with the objective of maximizing translation quality, as quantified by the CodeBLEU metric. This alignment ensures that the embeddings are semantically and syntactically meaningful for the specific code translation task. Our methodology involves constructing a dataset of 25,000 Fortran code snippets sourced from Stack-V2 dataset and generating their corresponding C++ translations using the LLaMA 3.1-8B language model. We compute pairwise CodeBLEU scores between the generated translations and ground truth examples to capture fine-grained similarities. These scores serve as supervision signals in a contrastive learning framework, where we optimize the embedding model to retrieve Fortran-C++ pairs that are most beneficial for improving the language model's translation performance. By integrating these CodeBLEU-optimized embeddings into the RAG framework, our approach significantly enhances both retrieval accuracy and code generation quality over methods employing generic embeddings. On the HPC Fortran2C++ dataset, our method elevates the average CodeBLEU score from 0.64 to 0.73, achieving a 14% relative improvement. On the Numerical Recipes dataset, we observe an increase from 0.52 to 0.60, marking a 15% relative improvement. Importantly, these gains are realized without any fine-tuning of the language model, underscoring the efficiency and practicality of our approach.
