Table of Contents
Fetching ...

InnoGym: Benchmarking the Innovation Potential of AI Agents

Jintian Zhang, Kewei Xu, Jingsheng Zheng, Zhuoyun Yu, Yuqi Zhu, Yujie Luo, Lanning Wei, Shuofei Qiao, Lun Du, Da Zheng, Shumin Deng, Huajun Chen, Ningyu Zhang

TL;DR

InnoGym addresses the limitation of correctness-centric benchmarks by introducing a principled framework to evaluate AI agents' innovation potential. It formalizes tasks as $(P,S,V,D)$ and defines two complementary metrics, $G$ (performance gain) and $N$ (novelty), to capture both effectiveness and methodological novelty. The framework is implemented via iBench (18 Improvable Tasks) and iGym (unified execution environment) to enable reproducible, long-horizon evaluations across domains. Experiments reveal strong novelty in some agents but limited robustness on complex tasks, underscoring that genuine innovation requires a balance of originality and reliable performance. Overall, InnoGym provides a cross-domain platform and methodology for advancing AI agents toward meaningful scientific and engineering innovation.

Abstract

LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.

InnoGym: Benchmarking the Innovation Potential of AI Agents

TL;DR

InnoGym addresses the limitation of correctness-centric benchmarks by introducing a principled framework to evaluate AI agents' innovation potential. It formalizes tasks as and defines two complementary metrics, (performance gain) and (novelty), to capture both effectiveness and methodological novelty. The framework is implemented via iBench (18 Improvable Tasks) and iGym (unified execution environment) to enable reproducible, long-horizon evaluations across domains. Experiments reveal strong novelty in some agents but limited robustness on complex tasks, underscoring that genuine innovation requires a balance of originality and reliable performance. Overall, InnoGym provides a cross-domain platform and methodology for advancing AI agents toward meaningful scientific and engineering innovation.

Abstract

LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.

Paper Structure

This paper contains 84 sections, 8 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: An illustration of our definition framework. (a) Core evaluation metrics. Innovation is evaluated along two dimensions: Performance ($V$) and Novelty ($N$). The colored shapes represent different candidate solutions, while the radius of the background concentric circles corresponds to the magnitude of performance $V(s)$ (larger radius indicates higher performance). (b) The solution space is partitioned by feasibility ($C(s)$) and prior knowledge. Feasible solutions (i.e., $C(s)=1$) are candidates for evaluation. (c--e) Categorization of three innovative tasks based on the spatial distribution of solutions relative to the knowledge boundary.
  • Figure 2: Dataset curation overview. We collect 197 tasks from public competitions, filter by resource and evaluator availability, and standardize scoring (executability, correctness, absolute metrics). After augmentation with validators, task specifications, solutions, and environments, the benchmark yields 18 balanced and diverse tasks across domains and hardware.
  • Figure 3: Overview of evaluation pipeline.
  • Figure 4: The architecture of iGym.
  • Figure 5: An illustration of the solution development process. (a) Solution Space Tree for Development: each node represents a candidate solution, where the Roman numeral denotes the iteration order, the first value indicates performance, and the underlined value denotes novelty. (b) Vector-Space Representation of the Solution Development Process: a complex-plane mapping that jointly encodes performance gain (magnitude) and normalized novelty (angle), providing a richer interpretation of the development trajectory.
  • ...and 2 more figures