Evaluating Large Language Models in Code Generation: INFINITE Methodology for Defining the Inference Index
Nicholas Christakis, Dimitris Drikakis
TL;DR
The paper presents INFINITE, a holistic framework for evaluating LLM-driven code generation by jointly measuring Efficiency, Consistency, and Accuracy through the Inference Index (InI). Applying this to GPT-4o, OAI1, and OAI3 on Python code that implements an LSTM-based forecasting of meteorological variables, the study finds GPT often achieving the highest InI and faster, more consistent performance, while OAI3 closely rivals GPT and OAI1 lags behind in consistency. All AI-generated solutions yield predictions similar to a manually developed LSTM-H baseline, demonstrating that well-prompted and iteratively refined AI coding can reach expert-level results. The INFINITE framework thus provides a practical, multi-faceted tool for real-world deployment of AI-assisted coding in scientific contexts, with potential extensions to broader tasks and models.
Abstract
This study introduces a new methodology for an Inference Index (InI), called INFerence INdex In Testing model Effectiveness methodology (INFINITE), aiming to evaluate the performance of Large Language Models (LLMs) in code generation tasks. The InI index provides a comprehensive assessment focusing on three key components: efficiency, consistency, and accuracy. This approach encapsulates time-based efficiency, response quality, and the stability of model outputs, offering a thorough understanding of LLM performance beyond traditional accuracy metrics. We applied this methodology to compare OpenAI's GPT-4o (GPT), OpenAI-o1 pro (OAI1), and OpenAI-o3 mini-high (OAI3) in generating Python code for the Long-Short-Term-Memory (LSTM) model to forecast meteorological variables such as temperature, relative humidity and wind velocity. Our findings demonstrate that GPT outperforms OAI1 and performs comparably to OAI3 regarding accuracy and workflow efficiency. The study reveals that LLM-assisted code generation can produce results similar to expert-designed models with effective prompting and refinement. GPT's performance advantage highlights the benefits of widespread use and user feedback.
