NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness
Manav Singhal, Tushar Aggarwal, Abhijeet Awasthi, Nagarajan Natarajan, Aditya Kanade
TL;DR
NoFunEval demonstrates that state-of-the-art code LMs struggle to satisfy non-functional requirements and to comprehend how such requirements relate to code semantics. By introducing NoFunEdit, NoFunClassify, and HumanEvalClassify, along with the Coding Concepts prompting strategy, the paper reveals substantial gaps between a model's ability to generate or edit functionally correct code and its ability to reason about latency, resource usage, maintainability, and security. The study analyzes 27 code LMs across multiple languages, showing that non-functional tasks are notably harder than functional ones, and that classification-based comprehension lags behind edit-generation capabilities. DiffBLEU proves to be a practical, lightweight proxy metric that correlates well with more expensive execution- and static-analysis-based measures, underscoring the benchmark’s utility for scalable evaluation and future improvements in code-language modeling. The work calls for continuous benchmark evolution and richer prompts to bridge the gap between what code LMs can produce and what practitioners require in real-world software engineering.
Abstract
Existing evaluation benchmarks of language models of code (code LMs) focus almost exclusively on whether the LMs can generate functionally-correct code. In real-world software engineering, developers think beyond functional correctness. They have requirements on "how" a functionality should be implemented to meet overall system design objectives like efficiency, security, and maintainability. They would also trust the code LMs more if the LMs demonstrate robust understanding of such requirements. We propose a new benchmark NoFunEval to evaluate code LMs on non-functional requirements and simple classification instances for both functional and non-functional requirements. We propose a prompting method, Coding Concepts (CoCo), as a way for a developer to communicate the domain knowledge to the LMs. We conduct an extensive evaluation of 27 code LMs. Our finding is that LMs generally falter when tested on our benchmark, hinting at fundamental blindspots in their training setups. Surprisingly, even the classification accuracy on functional-correctness instances derived from the popular HumanEval benchmark is low, calling in question the depth of their comprehension and the source of their success in generating functionally-correct code in the first place. We release our benchmark and evaluation scripts publicly at https://aka.ms/NoFunEval.
