Table of Contents
Fetching ...

Feedback Loops and Code Perturbations in LLM-based Software Engineering: A Case Study on a C-to-Rust Translation System

Martin Weiss, Jesko Hecking-Harbusch, Jochen Quante, Matthias Woehrle

TL;DR

The paper examines how feedback loops, LLM selection, and input perturbations affect an automated C-to-Rust translation system that uses a generate-and-check approach with compilability and behavioral equivalence checks. By evaluating across multiple LLMs and a diverse perturbation set on a mixed benchmark, it demonstrates that internal feedback loops substantially improve translation success and reduce model discrepancies, while code perturbations primarily offer robustness and diversity benefits. The work quantifies the cost-benefit trade-offs of iterative prompting and shows that perturbations can enhance performance when combined with feedback loops. These findings inform practical design choices for industrial, correctness-guarded LLM-based software engineering tools that translate or transform code safely.

Abstract

The advent of strong generative AI has a considerable impact on various software engineering tasks such as code repair, test generation, or language translation. While tools like GitHub Copilot are already in widespread use in interactive settings, automated approaches require a higher level of reliability before being usable in industrial practice. In this paper, we focus on three aspects that directly influence the quality of the results: a) the effect of automated feedback loops, b) the choice of Large Language Model (LLM), and c) the influence of behavior-preserving code changes. We study the effect of these three variables on an automated C-to-Rust translation system. Code translation from C to Rust is an attractive use case in industry due to Rust's safety guarantees. The translation system is based on a generate-and-check pattern, in which Rust code generated by the LLM is automatically checked for compilability and behavioral equivalence with the original C code. For negative checking results, the LLM is re-prompted in a feedback loop to repair its output. These checks also allow us to evaluate and compare the respective success rates of the translation system when varying the three variables. Our results show that without feedback loops LLM selection has a large effect on translation success. However, when the translation system uses feedback loops the differences across models diminish. We observe this for the average performance of the system as well as its robustness under code perturbations. Finally, we also identify that diversity provided by code perturbations can even result in improved system performance.

Feedback Loops and Code Perturbations in LLM-based Software Engineering: A Case Study on a C-to-Rust Translation System

TL;DR

The paper examines how feedback loops, LLM selection, and input perturbations affect an automated C-to-Rust translation system that uses a generate-and-check approach with compilability and behavioral equivalence checks. By evaluating across multiple LLMs and a diverse perturbation set on a mixed benchmark, it demonstrates that internal feedback loops substantially improve translation success and reduce model discrepancies, while code perturbations primarily offer robustness and diversity benefits. The work quantifies the cost-benefit trade-offs of iterative prompting and shows that perturbations can enhance performance when combined with feedback loops. These findings inform practical design choices for industrial, correctness-guarded LLM-based software engineering tools that translate or transform code safely.

Abstract

The advent of strong generative AI has a considerable impact on various software engineering tasks such as code repair, test generation, or language translation. While tools like GitHub Copilot are already in widespread use in interactive settings, automated approaches require a higher level of reliability before being usable in industrial practice. In this paper, we focus on three aspects that directly influence the quality of the results: a) the effect of automated feedback loops, b) the choice of Large Language Model (LLM), and c) the influence of behavior-preserving code changes. We study the effect of these three variables on an automated C-to-Rust translation system. Code translation from C to Rust is an attractive use case in industry due to Rust's safety guarantees. The translation system is based on a generate-and-check pattern, in which Rust code generated by the LLM is automatically checked for compilability and behavioral equivalence with the original C code. For negative checking results, the LLM is re-prompted in a feedback loop to repair its output. These checks also allow us to evaluate and compare the respective success rates of the translation system when varying the three variables. Our results show that without feedback loops LLM selection has a large effect on translation success. However, when the translation system uses feedback loops the differences across models diminish. We observe this for the average performance of the system as well as its robustness under code perturbations. Finally, we also identify that diversity provided by code perturbations can even result in improved system performance.

Paper Structure

This paper contains 22 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of our C-to-Rust translation system and the three major experiments we perform in this paper: ➀ impact of internal feedback loops (symbolized by $i$) and multiple runs (symbolized by $k$) on performance, ➁ effect of LLM selection on performance, and ➂ robustness under code perturbations.
  • Figure 2: pass@$k$ performance versus sum of generated tokens: The curves show how solution success rates (from pass@$1$ to pass@$5$) improve with larger token budgets for increasing maximal iterations of inner feedback loop.
  • Figure 3: pass@$5$ performance: Comparing internal benchmark files versus external benchmark files.
  • Figure 4: Detailed analysis of which check fails the translation tasks for each iteration of the feedback loop.
  • Figure 5: pass@$k$ performance across three models for $k$ up to 20 runs. Different styles are used for varying iteration counts.
  • ...and 4 more figures