Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code

Shahin Honarvar; Mark van der Wilk; Alastair Donaldson

Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code

Shahin Honarvar, Mark van der Wilk, Alastair Donaldson

TL;DR

This work tackles the challenge of reliably evaluating the correctness and robustness of instruction-tuned LLMs for code by introducing Turbulence, a neighbourhood-based benchmark. It leverages parameterised question templates and corresponding test oracles to create families of related programming problems, enabling detection of discontinuities where models solve most instances but fail for specific parameter instantiations. The study defines CorrSc as the mean correctness across multiple attempts and instantiates 60 templates with 100 valuations each, generating 6,000 question instances evaluated across five state-of-the-art models and two temperature settings, resulting in 300,000 responses. Findings show GPT-4 generally outperforms others, but all models exhibit robustness gaps, with lower temperature increasing determinism and reducing partial neighbourhood solves. The benchmark and data provide a systematic, reproducible framework for analyzing code-generation robustness and guiding future improvements and fine-tuning.

Abstract

We present a method for systematically evaluating the correctness and robustness of instruction-tuned large language models (LLMs) for code generation via a new benchmark, Turbulence. Turbulence consists of a large set of natural language $\textit{question templates}$, each of which is a programming problem, parameterised so that it can be asked in many different forms. Each question template has an associated $\textit{test oracle}$ that judges whether a code solution returned by an LLM is correct. Thus, from a single question template, it is possible to ask an LLM a $\textit{neighbourhood}$ of very similar programming questions, and assess the correctness of the result returned for each question. This allows gaps in an LLM's code generation abilities to be identified, including $\textit{anomalies}$ where the LLM correctly solves $\textit{almost all}$ questions in a neighbourhood but fails for particular parameter instantiations. We present experiments against five LLMs from OpenAI, Cohere and Meta, each at two temperature configurations. Our findings show that, across the board, Turbulence is able to reveal gaps in LLM reasoning ability. This goes beyond merely highlighting that LLMs sometimes produce wrong code (which is no surprise): by systematically identifying cases where LLMs are able to solve some problems in a neighbourhood but do not manage to generalise to solve the whole neighbourhood, our method is effective at highlighting $\textit{robustness}$ issues. We present data and examples that shed light on the kinds of mistakes that LLMs make when they return incorrect code results.

Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code

TL;DR

Abstract

, each of which is a programming problem, parameterised so that it can be asked in many different forms. Each question template has an associated

that judges whether a code solution returned by an LLM is correct. Thus, from a single question template, it is possible to ask an LLM a

of very similar programming questions, and assess the correctness of the result returned for each question. This allows gaps in an LLM's code generation abilities to be identified, including

where the LLM correctly solves

questions in a neighbourhood but fails for particular parameter instantiations. We present experiments against five LLMs from OpenAI, Cohere and Meta, each at two temperature configurations. Our findings show that, across the board, Turbulence is able to reveal gaps in LLM reasoning ability. This goes beyond merely highlighting that LLMs sometimes produce wrong code (which is no surprise): by systematically identifying cases where LLMs are able to solve some problems in a neighbourhood but do not manage to generalise to solve the whole neighbourhood, our method is effective at highlighting

issues. We present data and examples that shed light on the kinds of mistakes that LLMs make when they return incorrect code results.

Paper Structure (12 sections, 1 equation, 6 figures, 2 tables)

This paper contains 12 sections, 1 equation, 6 figures, 2 tables.

Introduction
Our Benchmarking Approach
The Turbulence Benchmark
Experimental Evaluation
Experimental Setup
Results Based on CorrSc
Results Based on Distinct Categories
Exploring Reasons for Failure
Threats to Validity
Related Work
Conclusions and Future Work
Data Availability Statement

Figures (6)

Figure 1: An example of a question template, test case template and model solution template, and an instantiation of each
Figure 2: Overview of our benchmarking approach
Figure 3: $\mathit{CorrSc}$ of question templates across the LLM configurations evaluated
Figure 4: Consistent failure, CodeLlama-7 ($t\!=\!0$)
Figure 5: Distribution of Turbulence question templates across result categories for the LLM configurations evaluated
...and 1 more figures

Theorems & Definitions (1)

Definition 1: Correctness Score

Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code

TL;DR

Abstract

Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (1)