Enhancing Code Translation in Language Models with Few-Shot Learning via Retrieval-Augmented Generation

Manish Bhattarai; Javier E. Santos; Shawn Jones; Ayan Biswas; Boian Alexandrov; Daniel O'Malley

Enhancing Code Translation in Language Models with Few-Shot Learning via Retrieval-Augmented Generation

Manish Bhattarai, Javier E. Santos, Shawn Jones, Ayan Biswas, Boian Alexandrov, Daniel O'Malley

TL;DR

Problem: automated code translation by LLMs struggles with context and complex constructs. Approach: a Retrieval-Augmented Generation framework that retrieves relevant translation examples from a code corpus to guide few-shot translations, using multiple embeddings and vector stores. Contributions: a detailed evaluation across open and commercial LLMs on Fortran-to-C++ datasets showing consistent CodeBLEU gains with RAG, analysis of shot numbers and embeddings, and dataset-specific insights. Significance: demonstrates dynamic, scalable adaptation to diverse translation tasks without extensive fine-tuning, informing future code-translation systems.

Abstract

The advent of large language models (LLMs) has significantly advanced the field of code translation, enabling automated translation between programming languages. However, these models often struggle with complex translation tasks due to inadequate contextual understanding. This paper introduces a novel approach that enhances code translation through Few-Shot Learning, augmented with retrieval-based techniques. By leveraging a repository of existing code translations, we dynamically retrieve the most relevant examples to guide the model in translating new code segments. Our method, based on Retrieval-Augmented Generation (RAG), substantially improves translation quality by providing contextual examples from which the model can learn in real-time. We selected RAG over traditional fine-tuning methods due to its ability to utilize existing codebases or a locally stored corpus of code, which allows for dynamic adaptation to diverse translation tasks without extensive retraining. Extensive experiments on diverse datasets with open LLM models such as Starcoder, Llama3-70B Instruct, CodeLlama-34B Instruct, Granite-34B Code Instruct, and Mixtral-8x22B, as well as commercial LLM models like GPT-3.5 Turbo and GPT-4o, demonstrate our approach's superiority over traditional zero-shot methods, especially in translating between Fortran and CPP. We also explored varying numbers of shots i.e. examples provided during inference, specifically 1, 2, and 3 shots and different embedding models for RAG, including Nomic-Embed, Starencoder, and CodeBERT, to assess the robustness and effectiveness of our approach.

Enhancing Code Translation in Language Models with Few-Shot Learning via Retrieval-Augmented Generation

TL;DR

Abstract

Paper Structure (13 sections, 3 equations, 7 figures, 2 tables)

This paper contains 13 sections, 3 equations, 7 figures, 2 tables.

Introduction
Related works
Methods
Numerical Recipes Dataset:
HPC Fortran2CPP Dataset:
Stack-V2 Dataset:
Results and Discussions
Performance Across Models and Embeddings
Impact of Few-Shot Learning
Dataset-Specific Performance: HPC Fortran2CPP vs. Numerical Recipes in Few-Shot setting
Impact of LLM models
Results on Unlabelled Dataset in zero-shot settings
Conclusion and Future Work

Figures (7)

Figure 1: Pipeline for creating a few shot prompt through RAG for code translation.
Figure 2: Similarity of dataset embeddings for a)HPC Fortran2CPP dataset and b) Numerical Receipe dataset based on Nomic Embed model
Figure 3: Zero-Shot Translation Prompt Template
Figure 4: Few Shot Translation Prompt Template
Figure 5: Performance Comparison of One-shot vs. Zero-shot in the RAG Pipeline Using the Nomic-embed Embedding Model across Various Models and Datasets: (a) Granite-34B Code Instruct on the Numerical Recipes Dataset, (b) Granite-34B Code Instruct on the HPC Fortran2CPP Dataset, and (c) Granite-34B Code Instruct on the HPC Fortran2CPP Dataset with bad RAG setup (utilize largest distance metric as retreival) . The color of each data point represents the similarity of the retrieved one-shot example pair to the query Fortran code, with the legend indicating the intensity range of the similarity metric. Generally, a higher similarity score correlates with a higher CodeBLEU score.
...and 2 more figures

Enhancing Code Translation in Language Models with Few-Shot Learning via Retrieval-Augmented Generation

TL;DR

Abstract

Enhancing Code Translation in Language Models with Few-Shot Learning via Retrieval-Augmented Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)