Table of Contents
Fetching ...

BaxBench: Can LLMs Generate Correct and Secure Backends?

Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanović, Jingxuan He, Martin Vechev

TL;DR

BaxBench introduces a deployment-ready backend code-generation benchmark for LLMs, combining 28 backend scenarios, 14 frameworks, and 6 languages to yield 392 tasks evaluated with OpenAPI-driven correctness tests and end-to-end security exploits. The study demonstrates that even flagship models struggle to produce correct and secure backends, with pass@1 around 62% for correctness and sec_pass@1 peaking near 35% under the best conditions, and with exploitation vulnerabilities prevalent in correct solutions. The authors show that prompt design (notably security guidance) and test-time reasoning enhance security performance, and that test-time agents can further improve outcomes but do not fully close the gap. BaxBench provides a modular, extensible framework enabling rigorous, framework- and language-agnostic evaluation of future LLMs on realistic backend generation tasks, guiding the development of safer, more autonomous software pipelines.

Abstract

Automatic program generation has long been a fundamental challenge in computer science. Recent benchmarks have shown that large language models (LLMs) can effectively generate code at the function level, make code edits, and solve algorithmic coding tasks. However, to achieve full automation, LLMs should be able to generate production-quality, self-contained application modules. To evaluate the capabilities of LLMs in solving this challenge, we introduce BaxBench, a novel evaluation benchmark consisting of 392 tasks for the generation of backend applications. We focus on backends for three critical reasons: (i) they are practically relevant, building the core components of most modern web and cloud software, (ii) they are difficult to get right, requiring multiple functions and files to achieve the desired functionality, and (iii) they are security-critical, as they are exposed to untrusted third-parties, making secure solutions that prevent deployment-time attacks an imperative. BaxBench validates the functionality of the generated applications with comprehensive test cases, and assesses their security exposure by executing end-to-end exploits. Our experiments reveal key limitations of current LLMs in both functionality and security: (i) even the best model, OpenAI o1, achieves a mere 62% on code correctness; (ii) on average, we could successfully execute security exploits on around half of the correct programs generated by each LLM; and (iii) in less popular backend frameworks, models further struggle to generate correct and secure applications. Progress on BaxBench signifies important steps towards autonomous and secure software development with LLMs.

BaxBench: Can LLMs Generate Correct and Secure Backends?

TL;DR

BaxBench introduces a deployment-ready backend code-generation benchmark for LLMs, combining 28 backend scenarios, 14 frameworks, and 6 languages to yield 392 tasks evaluated with OpenAPI-driven correctness tests and end-to-end security exploits. The study demonstrates that even flagship models struggle to produce correct and secure backends, with pass@1 around 62% for correctness and sec_pass@1 peaking near 35% under the best conditions, and with exploitation vulnerabilities prevalent in correct solutions. The authors show that prompt design (notably security guidance) and test-time reasoning enhance security performance, and that test-time agents can further improve outcomes but do not fully close the gap. BaxBench provides a modular, extensible framework enabling rigorous, framework- and language-agnostic evaluation of future LLMs on realistic backend generation tasks, guiding the development of safer, more autonomous software pipelines.

Abstract

Automatic program generation has long been a fundamental challenge in computer science. Recent benchmarks have shown that large language models (LLMs) can effectively generate code at the function level, make code edits, and solve algorithmic coding tasks. However, to achieve full automation, LLMs should be able to generate production-quality, self-contained application modules. To evaluate the capabilities of LLMs in solving this challenge, we introduce BaxBench, a novel evaluation benchmark consisting of 392 tasks for the generation of backend applications. We focus on backends for three critical reasons: (i) they are practically relevant, building the core components of most modern web and cloud software, (ii) they are difficult to get right, requiring multiple functions and files to achieve the desired functionality, and (iii) they are security-critical, as they are exposed to untrusted third-parties, making secure solutions that prevent deployment-time attacks an imperative. BaxBench validates the functionality of the generated applications with comprehensive test cases, and assesses their security exposure by executing end-to-end exploits. Our experiments reveal key limitations of current LLMs in both functionality and security: (i) even the best model, OpenAI o1, achieves a mere 62% on code correctness; (ii) on average, we could successfully execute security exploits on around half of the correct programs generated by each LLM; and (iii) in less popular backend frameworks, models further struggle to generate correct and secure applications. Progress on BaxBench signifies important steps towards autonomous and secure software development with LLMs.

Paper Structure

This paper contains 49 sections, 1 equation, 30 figures, 14 tables.

Figures (30)

  • Figure 1: Even flagship models struggle to generate correct and secure application backends, signifying that LLMs are not yet ready for deployment-ready coding automation.
  • Figure 2: Overview of the structure and execution process of BaxBench. The benchmark consists of $28$ scenarios describing backend applications and $14$ popular backend framework environments across $6$ programming languages. Combined, these result in $392$ challenging benchmark tasks. To evaluate an LLM, we prompt it with the scenario specification to generate a set of code files and assets that implement the scenario. We evaluate the correctness of those solutions using functional tests, and attempt to practically exploit the LLM code, targeting specific vulnerabilities.
  • Figure 3: Evaluation results of $11$ LLMs on the $392$ tasks of BaxBench. Full bars represent sec_pass@1, while full bars and shaded bars together show pass@1. Concerningly, around $50\%$ of the passing programs for each model are exploitable. While sec_pass@1 is significantly higher for models with a higher pass@1 score, even for the best model, OpenAI o3-mini, it only reaches $35\%$. As such, even flagship LLMs are not yet ready for automated development in production.
  • Figure 4: Impact of the generic and oracle-based security reminders on pass@1 and sec_pass@1.
  • Figure 5: Performance of OpenAI o1 across different frameworks on all prompt types. Frameworks requiring implementations across multiple files to launch an http server are marked with an asterisk$^*$. The model struggles more with less popular programming languages and multi-file frameworks. Results on other models are included in \ref{['appendix:model_performance_across_scenarios']}.
  • ...and 25 more figures