Table of Contents
Fetching ...

NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness

Manav Singhal, Tushar Aggarwal, Abhijeet Awasthi, Nagarajan Natarajan, Aditya Kanade

TL;DR

NoFunEval demonstrates that state-of-the-art code LMs struggle to satisfy non-functional requirements and to comprehend how such requirements relate to code semantics. By introducing NoFunEdit, NoFunClassify, and HumanEvalClassify, along with the Coding Concepts prompting strategy, the paper reveals substantial gaps between a model's ability to generate or edit functionally correct code and its ability to reason about latency, resource usage, maintainability, and security. The study analyzes 27 code LMs across multiple languages, showing that non-functional tasks are notably harder than functional ones, and that classification-based comprehension lags behind edit-generation capabilities. DiffBLEU proves to be a practical, lightweight proxy metric that correlates well with more expensive execution- and static-analysis-based measures, underscoring the benchmark’s utility for scalable evaluation and future improvements in code-language modeling. The work calls for continuous benchmark evolution and richer prompts to bridge the gap between what code LMs can produce and what practitioners require in real-world software engineering.

Abstract

Existing evaluation benchmarks of language models of code (code LMs) focus almost exclusively on whether the LMs can generate functionally-correct code. In real-world software engineering, developers think beyond functional correctness. They have requirements on "how" a functionality should be implemented to meet overall system design objectives like efficiency, security, and maintainability. They would also trust the code LMs more if the LMs demonstrate robust understanding of such requirements. We propose a new benchmark NoFunEval to evaluate code LMs on non-functional requirements and simple classification instances for both functional and non-functional requirements. We propose a prompting method, Coding Concepts (CoCo), as a way for a developer to communicate the domain knowledge to the LMs. We conduct an extensive evaluation of 27 code LMs. Our finding is that LMs generally falter when tested on our benchmark, hinting at fundamental blindspots in their training setups. Surprisingly, even the classification accuracy on functional-correctness instances derived from the popular HumanEval benchmark is low, calling in question the depth of their comprehension and the source of their success in generating functionally-correct code in the first place. We release our benchmark and evaluation scripts publicly at https://aka.ms/NoFunEval.

NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness

TL;DR

NoFunEval demonstrates that state-of-the-art code LMs struggle to satisfy non-functional requirements and to comprehend how such requirements relate to code semantics. By introducing NoFunEdit, NoFunClassify, and HumanEvalClassify, along with the Coding Concepts prompting strategy, the paper reveals substantial gaps between a model's ability to generate or edit functionally correct code and its ability to reason about latency, resource usage, maintainability, and security. The study analyzes 27 code LMs across multiple languages, showing that non-functional tasks are notably harder than functional ones, and that classification-based comprehension lags behind edit-generation capabilities. DiffBLEU proves to be a practical, lightweight proxy metric that correlates well with more expensive execution- and static-analysis-based measures, underscoring the benchmark’s utility for scalable evaluation and future improvements in code-language modeling. The work calls for continuous benchmark evolution and richer prompts to bridge the gap between what code LMs can produce and what practitioners require in real-world software engineering.

Abstract

Existing evaluation benchmarks of language models of code (code LMs) focus almost exclusively on whether the LMs can generate functionally-correct code. In real-world software engineering, developers think beyond functional correctness. They have requirements on "how" a functionality should be implemented to meet overall system design objectives like efficiency, security, and maintainability. They would also trust the code LMs more if the LMs demonstrate robust understanding of such requirements. We propose a new benchmark NoFunEval to evaluate code LMs on non-functional requirements and simple classification instances for both functional and non-functional requirements. We propose a prompting method, Coding Concepts (CoCo), as a way for a developer to communicate the domain knowledge to the LMs. We conduct an extensive evaluation of 27 code LMs. Our finding is that LMs generally falter when tested on our benchmark, hinting at fundamental blindspots in their training setups. Surprisingly, even the classification accuracy on functional-correctness instances derived from the popular HumanEval benchmark is low, calling in question the depth of their comprehension and the source of their success in generating functionally-correct code in the first place. We release our benchmark and evaluation scripts publicly at https://aka.ms/NoFunEval.
Paper Structure (23 sections, 12 figures, 7 tables)

This paper contains 23 sections, 12 figures, 7 tables.

Figures (12)

  • Figure 1: (a)NoFunEval contributes edit and comprehension tasks, NoFunEdit and NoFunClassify, for non-functional requirements, and complements HumanEval and HumanEvalFix with a comprehension task HumanEvalClassify. (b)--(c): Performance of LMs on NoFunEval, HumanEval, and HumanEvalFix benchmarks (metrics, full results in § \ref{['sec:results']}). For consistency, in plot (c), we include only those instances with a binary evaluation oracle.
  • Figure 2: Overview of the NoFunEval benchmark. NoFunEval consists of three subtasks, NoFunEdit, NoFunClassify, HumanEvalClassify, spanning multiple programming languages. NoFunEdit (§ \ref{['sec:nofunedit']}) involves editing a given source code as per a user-specified non-functional requirement (e.g., improving memory usage). We design four prompting techniques (§ \ref{['sec:prompting']}) for eliciting LMs to perform the required editing, ranging from minimal task-related information ("Base") to guiding with high-level hints ("Coding Concepts"). NoFunClassify (§ \ref{['sec:nofunclassify']}) involves distinguishing between two code snippets based on a non-functional property (e.g., selecting the code with lower memory utilization). We construct it by reformulating problems in NoFunEdit. Similarly, we construct HumanEvalClassify (§ \ref{['sec:he_classify']}) by reformulating HumanEvalFixmuennighoff2023octopack, which involves distinguishing two code snippets based on their functional correctness (i.e., bug detection).
  • Figure A.1: An example Base Prompt for improving bandwidth usage in code for an Android application (§ \ref{['sec:prompting']}).
  • Figure A.2: An example 1-Shot / Chain-of-Thought prompt template for fixing a maintainability issue ("Unguarded next in generator") as flagged by CodeQL. The underlined texts are instantiated based on the example. The shaded text denotes the reasoning we include for the corresponding Chain-of-Thought prompt (§ \ref{['sec:prompting']}).
  • Figure A.3: An example CoCo prompt template for fixing a security issue ("Deserialization of Untrusted Data") as flagged by CodeQL (§ \ref{['sec:prompting']}). The underlined texts are instantiated based on the example.
  • ...and 7 more figures