Table of Contents
Fetching ...

BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?

Pierre Chambon, Baptiste Roziere, Benoit Sagot, Gabriel Synnaeve

TL;DR

BigO(Bench) introduces a thorough benchmark and accompanying framework to evaluate and improve large language models on coding tasks that must respect explicit time and space complexity constraints. It combines a dynamic complexity inference pipeline with a large Code Contests–derived dataset, enabling three evaluation tasks: predict complexity, generate code meeting a complexity constraint, and rank solutions by a learned complexity coefficient. The results show strong code-generation capabilities in token-space models but limited robustness in complexity reasoning, even with fine-tuning, highlighting a gap between functional correctness and complexity-aware programming. The work provides data, tooling, and baselines to push development toward models that reason about and optimize for algorithmic efficiency in practice.

Abstract

We introduce BigO(Bench), a novel coding benchmark designed to evaluate the capabilities of generative language models in understanding and generating code with specified time and space complexities. This benchmark addresses the gap in current evaluations that often overlook the ability of models to comprehend and produce code constrained by computational complexity. BigO(Bench) includes tooling to infer the algorithmic complexity of any Python function from profiling measurements, including human- or LLM-generated solutions. BigO(Bench) also includes of set of 3,105 coding problems and 1,190,250 solutions from Code Contests annotated with inferred (synthetic) time and space complexity labels from the complexity framework, as well as corresponding runtime and memory footprint values for a large set of input sizes. We present results from evaluating multiple state-of-the-art language models on this benchmark, highlighting their strengths and weaknesses in handling complexity requirements. In particular, token-space reasoning models are unrivaled in code generation but not in complexity understanding, hinting that they may not generalize well to tasks for which no reward was given at training time.

BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?

TL;DR

BigO(Bench) introduces a thorough benchmark and accompanying framework to evaluate and improve large language models on coding tasks that must respect explicit time and space complexity constraints. It combines a dynamic complexity inference pipeline with a large Code Contests–derived dataset, enabling three evaluation tasks: predict complexity, generate code meeting a complexity constraint, and rank solutions by a learned complexity coefficient. The results show strong code-generation capabilities in token-space models but limited robustness in complexity reasoning, even with fine-tuning, highlighting a gap between functional correctness and complexity-aware programming. The work provides data, tooling, and baselines to push development toward models that reason about and optimize for algorithmic efficiency in practice.

Abstract

We introduce BigO(Bench), a novel coding benchmark designed to evaluate the capabilities of generative language models in understanding and generating code with specified time and space complexities. This benchmark addresses the gap in current evaluations that often overlook the ability of models to comprehend and produce code constrained by computational complexity. BigO(Bench) includes tooling to infer the algorithmic complexity of any Python function from profiling measurements, including human- or LLM-generated solutions. BigO(Bench) also includes of set of 3,105 coding problems and 1,190,250 solutions from Code Contests annotated with inferred (synthetic) time and space complexity labels from the complexity framework, as well as corresponding runtime and memory footprint values for a large set of input sizes. We present results from evaluating multiple state-of-the-art language models on this benchmark, highlighting their strengths and weaknesses in handling complexity requirements. In particular, token-space reasoning models are unrivaled in code generation but not in complexity understanding, hinting that they may not generalize well to tasks for which no reward was given at training time.

Paper Structure

This paper contains 26 sections, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: BigO(Bench) framework overview: Given a coding problem and human solutions, the framework evaluates language models on three key tasks: (1) predicting time-space complexities of existing solutions, (2) generating new code that meets specified complexity requirements, and (3) ranking solutions against human-written code with similar complexity profiles. The complexity framework automatically validates model outputs by computing runtime distributions and curve coefficients.
  • Figure 2: Outline of the dynamic complexity inference framework. The framework takes a code snippet and a single example of inputs to this code snippet. Then, it processes the code snippet and proceeds with extensive inputs generation, based on the provided example of inputs: inputs are independently or interdependently increased in size, using several expansion methods that can be the identity or random, among else. This forms a queue of synthetic inputs on which to execute the provided code snippet. These executions happen independently in sandboxes, where runtime and memory footprint measures are taken. Once all the measures are collected, the framework can model the code snippet time and space dependencies to the different inputs. Using curve fitting, the time and space complexity of the code is computed on each input separately and then altogether. The global time and space complexity over all inputs is what is being returned.
  • Figure 3: Distribution of time-space complexity classes across BigO(Bench) dataset of 3,105 coding problems. Each problem is included when at least one solution exists with that specific time-space complexity pair. Linear time O(n) represents 38% of solutions, while constant space O(1) accounts for 25%. The chart orders classes by computational efficiency, with less common classes grouped under "other’’. Problems for which the framework can not infer a time complexity and/or a space complexity are not counted.
  • Figure 4: Failure rate analysis of the complexity inference framework. The top plot shows the overall distribution of framework failures across all problems. The bottom heatmap breaks down failure rates by input type and number of distinct inputs. Approximately 84% of problems have failure rates below 30%, demonstrating robust performance across most input configurations.
  • Figure 5: LLM results aggregated by time complexity class and by algorithmic notions for all models part of BigO(Bench).
  • ...and 4 more figures