Table of Contents
Fetching ...

Towards Translating Real-World Code with LLMs: A Study of Translating to Rust

Hasan Ferit Eniser, Hanliang Zhang, Cristina David, Meng Wang, Maria Christakis, Brandon Paulsen, Joey Dodds, Daniel Kroening

TL;DR

This paper tackles the problem of translating real-world code to Rust using large language models, addressing the lack of scalable, semantically validated translations. It introduces Fluorine, an end-to-end framework that enforces I/O equivalence between source programs and their Rust translations via a cross-language differential fuzzer, removing reliance on existing unit tests. The study evaluates five state-of-the-art LLMs across 408 code samples from seven real-world projects and four feedback strategies, revealing that up to 47% of benchmarks can be translated correctly, with repairs from scratch often outperforming counterexample-driven repair. The work demonstrates practical potential for reducing manual porting effort in real-world settings (e.g., AWS), while outlining paths to handle larger benchmarks and improve feedback effectiveness.

Abstract

Large language models (LLMs) show promise in code translation - the task of translating code written in one programming language to another language - due to their ability to write code in most programming languages. However, LLM's effectiveness on translating real-world code remains largely unstudied. In this work, we perform the first substantial study on LLM-based translation to Rust by assessing the ability of five state-of-the-art LLMs, GPT4, Claude 3, Claude 2.1, Gemini Pro, and Mixtral. We conduct our study on code extracted from real-world open source projects. To enable our study, we develop FLOURINE, an end-to-end code translation tool that uses differential fuzzing to check if a Rust translation is I/O equivalent to the original source program, eliminating the need for pre-existing test cases. As part of our investigation, we assess both the LLM's ability to produce an initially successful translation, as well as their capacity to fix a previously generated buggy one. If the original and the translated programs are not I/O equivalent, we apply a set of automated feedback strategies, including feedback to the LLM with counterexamples. Our results show that the most successful LLM can translate 47% of our benchmarks, and also provides insights into next steps for improvements.

Towards Translating Real-World Code with LLMs: A Study of Translating to Rust

TL;DR

This paper tackles the problem of translating real-world code to Rust using large language models, addressing the lack of scalable, semantically validated translations. It introduces Fluorine, an end-to-end framework that enforces I/O equivalence between source programs and their Rust translations via a cross-language differential fuzzer, removing reliance on existing unit tests. The study evaluates five state-of-the-art LLMs across 408 code samples from seven real-world projects and four feedback strategies, revealing that up to 47% of benchmarks can be translated correctly, with repairs from scratch often outperforming counterexample-driven repair. The work demonstrates practical potential for reducing manual porting effort in real-world settings (e.g., AWS), while outlining paths to handle larger benchmarks and improve feedback effectiveness.

Abstract

Large language models (LLMs) show promise in code translation - the task of translating code written in one programming language to another language - due to their ability to write code in most programming languages. However, LLM's effectiveness on translating real-world code remains largely unstudied. In this work, we perform the first substantial study on LLM-based translation to Rust by assessing the ability of five state-of-the-art LLMs, GPT4, Claude 3, Claude 2.1, Gemini Pro, and Mixtral. We conduct our study on code extracted from real-world open source projects. To enable our study, we develop FLOURINE, an end-to-end code translation tool that uses differential fuzzing to check if a Rust translation is I/O equivalent to the original source program, eliminating the need for pre-existing test cases. As part of our investigation, we assess both the LLM's ability to produce an initially successful translation, as well as their capacity to fix a previously generated buggy one. If the original and the translated programs are not I/O equivalent, we apply a set of automated feedback strategies, including feedback to the LLM with counterexamples. Our results show that the most successful LLM can translate 47% of our benchmarks, and also provides insights into next steps for improvements.
Paper Structure (30 sections, 1 equation, 11 figures, 1 table, 1 algorithm)

This paper contains 30 sections, 1 equation, 11 figures, 1 table, 1 algorithm.

Figures (11)

  • Figure 1: Code sample from ACH
  • Figure 2: Function add from go-gt
  • Figure 3: Function from go-edlib
  • Figure 4: Rust translation of function from go-gt
  • Figure 5: LLM Prompt for obtaining translations.
  • ...and 6 more figures