Table of Contents
Fetching ...

Syntactic Robustness for LLM-based Code Generation

Laboni Sarker, Mara Downing, Achintya Desai, Tevfik Bultan

TL;DR

This paper formalizes syntactic robustness for LLM-based code generation, showing that semantically equivalent but syntactically different formulas can lead to non-equivalent code outputs. It defines a rigorous framework with G: P → C, semantic equivalence of formulas, program equivalence, and a robustness degree relative to a reference code C^R_F, then evaluates on univariate equations across linear, quadratic, trig, and logarithmic forms. By introducing mutation rules M1–M5 and a reductions-based pre-processing workflow, the authors demonstrate that GPT-3.5-Turbo and GPT-4 are not inherently robust to such syntactic variations, but robustness can be driven to 100% with prompt reduction. The work provides a principled prompting and testing pipeline, including differential testing against 1000 inputs and a carefully constructed set of 627 variations, offering practical guidance for reliable code generation in numerical tasks and informing future prompt-engineering strategies in safety-critical software contexts.

Abstract

Rapid advances in the field of Large Language Models (LLMs) have made LLM-based code generation an important area for investigation. An LLM-based code generator takes a prompt as input and produces code that implements the requirements specified in the prompt. Many software requirements include mathematical formulas that specify the expected behavior of the code to be generated. Given a code generation prompt that includes a mathematical formula, a reasonable expectation is that, if the formula is syntactically modified without changing its semantics, the generated code for the modified prompt should be semantically equivalent. We formalize this concept as syntactic robustness and investigate the syntactic robustness of GPT-3.5-Turbo and GPT-4 as code generators. To test syntactic robustness, we generate syntactically different but semantically equivalent versions of prompts using a set of mutators that only modify mathematical formulas in prompts. In this paper, we focus on prompts that ask for code that generates solutions to variables in an equation, when given coefficients of the equation as input. Our experimental evaluation demonstrates that GPT-3.5-Turbo and GPT-4 are not syntactically robust for this type of prompts. To improve syntactic robustness, we define a set of reductions that transform the formulas to a simplified form and use these reductions as a pre-processing step. Our experimental results indicate that the syntactic robustness of LLM-based code generation can be improved using our approach.

Syntactic Robustness for LLM-based Code Generation

TL;DR

This paper formalizes syntactic robustness for LLM-based code generation, showing that semantically equivalent but syntactically different formulas can lead to non-equivalent code outputs. It defines a rigorous framework with G: P → C, semantic equivalence of formulas, program equivalence, and a robustness degree relative to a reference code C^R_F, then evaluates on univariate equations across linear, quadratic, trig, and logarithmic forms. By introducing mutation rules M1–M5 and a reductions-based pre-processing workflow, the authors demonstrate that GPT-3.5-Turbo and GPT-4 are not inherently robust to such syntactic variations, but robustness can be driven to 100% with prompt reduction. The work provides a principled prompting and testing pipeline, including differential testing against 1000 inputs and a carefully constructed set of 627 variations, offering practical guidance for reliable code generation in numerical tasks and informing future prompt-engineering strategies in safety-critical software contexts.

Abstract

Rapid advances in the field of Large Language Models (LLMs) have made LLM-based code generation an important area for investigation. An LLM-based code generator takes a prompt as input and produces code that implements the requirements specified in the prompt. Many software requirements include mathematical formulas that specify the expected behavior of the code to be generated. Given a code generation prompt that includes a mathematical formula, a reasonable expectation is that, if the formula is syntactically modified without changing its semantics, the generated code for the modified prompt should be semantically equivalent. We formalize this concept as syntactic robustness and investigate the syntactic robustness of GPT-3.5-Turbo and GPT-4 as code generators. To test syntactic robustness, we generate syntactically different but semantically equivalent versions of prompts using a set of mutators that only modify mathematical formulas in prompts. In this paper, we focus on prompts that ask for code that generates solutions to variables in an equation, when given coefficients of the equation as input. Our experimental evaluation demonstrates that GPT-3.5-Turbo and GPT-4 are not syntactically robust for this type of prompts. To improve syntactic robustness, we define a set of reductions that transform the formulas to a simplified form and use these reductions as a pre-processing step. Our experimental results indicate that the syntactic robustness of LLM-based code generation can be improved using our approach.
Paper Structure (25 sections, 4 equations, 12 figures)

This paper contains 25 sections, 4 equations, 12 figures.

Figures (12)

  • Figure 1: Prompt Example 1 and the generated code by the LLM-based code generator.
  • Figure 2: Prompt Example 2 and the code generated by the LLM-based code generator.
  • Figure 3: Our context-free grammar for univariate polynomial, trigonometric and logarithmic equations.
  • Figure 4: Mutation rules for equations.
  • Figure 5: Reduction rules for equations.
  • ...and 7 more figures

Theorems & Definitions (11)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Definition 7
  • Definition 8
  • Definition 9
  • Definition 10
  • ...and 1 more