Unsupervised Evaluation of Code LLMs with Round-Trip Correctness

Miltiadis Allamanis; Sheena Panthaplackel; Pengcheng Yin

Unsupervised Evaluation of Code LLMs with Round-Trip Correctness

Miltiadis Allamanis, Sheena Panthaplackel, Pengcheng Yin

TL;DR

This work introduces Round-Trip Correctness ($RTC$) as an unsupervised framework to evaluate code LLMs beyond narrow, human-curated benchmarks. By pairing forward and backward generation between code and natural language and measuring semantic equivalence with a similarity oracle, RTC enables scalable assessment across diverse real-world domains and tasks, including code synthesis and editing. The authors instantiate RTC as SynthesisRtc and EditingRtc, demonstrate strong correlation with existing benchmarks on standard datasets, and reveal significant cross-domain variability when evaluating across many open-source projects and editing scenarios. The findings suggest RTC can complement traditional benchmarks to provide broader, domain-rich insights into code-generation capabilities, while highlighting the need for careful choice of similarity metrics and qualitative analysis to interpret results.

Abstract

To evaluate code large language models (LLMs), research has relied on a few small manually curated benchmarks, such as HumanEval and MBPP, which represent a narrow part of the real-world software domains. In this work, we introduce round-trip correctness (RTC) as an alternative evaluation method. RTC allows Code LLM evaluation on a broader spectrum of real-world software domains without the need for costly human curation. RTC rests on the idea that we can ask a model to make a prediction (e.g., describe some code using natural language), feed that prediction back (e.g., synthesize code from the predicted description), and check if this round-trip leads to code that is semantically equivalent to the original input. We show how to employ RTC to evaluate code synthesis and editing. We find that RTC strongly correlates with model performance on existing narrow-domain code synthesis benchmarks while allowing us to expand to a much broader set of domains and tasks which was not previously possible without costly human annotations.

Unsupervised Evaluation of Code LLMs with Round-Trip Correctness

TL;DR

This work introduces Round-Trip Correctness (

) as an unsupervised framework to evaluate code LLMs beyond narrow, human-curated benchmarks. By pairing forward and backward generation between code and natural language and measuring semantic equivalence with a similarity oracle, RTC enables scalable assessment across diverse real-world domains and tasks, including code synthesis and editing. The authors instantiate RTC as SynthesisRtc and EditingRtc, demonstrate strong correlation with existing benchmarks on standard datasets, and reveal significant cross-domain variability when evaluating across many open-source projects and editing scenarios. The findings suggest RTC can complement traditional benchmarks to provide broader, domain-rich insights into code-generation capabilities, while highlighting the need for careful choice of similarity metrics and qualitative analysis to interpret results.

Abstract

Paper Structure (24 sections, 3 equations, 6 figures, 3 tables)

This paper contains 24 sections, 3 equations, 6 figures, 3 tables.

Introduction
Round-Trip Correctness
Background
RTC for Model Evaluation
Measuring the forward lift
Limitations
RTC for Code
Round-trip Code Synthesis (SynthesisRtc)
Round-trip Code Editing (EditingRtc )
Evaluation
Experimental Setup
Does RTC correlate with existing metrics on narrow-domain benchmarks?
Evaluating Code-to-Description
Sensitivity of RTC
Do LLMs perform similarly across domains?
...and 9 more sections

Figures (6)

Figure 1: Round-trip correctness (RTC) for Code Synthesis: An LLM is asked to describe the highlighted code (left) within the context of the file. Subsequently, it is asked to implement the relevant code within the code context given the description it previously generated (right).
Figure 2: Round-trip correctness (RTC) for Code Synthesis across 58 open-source projects of diverse domains for Gemini Pro and Nano 2: $\textsf{RTC}\xspace_{\text{pass}}$ varies widely across projects/domains, something that common code synthesis benchmarks fail to capture.
Figure 3: SynthesisRtc example from https://github.com/more-itertools/more-itertools/blob/8e4b048d9962e655d7b7f9bfc5e8b0675d70697f/more_itertools/more.py#L4278 from Gemini Pro. Code slightly reformatted/abbreviated for space.
Figure 4: A CodeReviewer example with Gemini Pro predictions: 3 sampled descriptions in the forward pass as well as their corresponding predicted edits in the backward pass. We additionally include predictions from the backward pass when the provided description is instead the PR comment or baseline description. Examples have minor edits/re-format due to space constraints.
Figure 5: Lift
...and 1 more figures

Unsupervised Evaluation of Code LLMs with Round-Trip Correctness

TL;DR

Abstract

Unsupervised Evaluation of Code LLMs with Round-Trip Correctness

Authors

TL;DR

Abstract

Table of Contents

Figures (6)