Table of Contents
Fetching ...

Evaluating Large Language Models in Code Generation: INFINITE Methodology for Defining the Inference Index

Nicholas Christakis, Dimitris Drikakis

TL;DR

The paper presents INFINITE, a holistic framework for evaluating LLM-driven code generation by jointly measuring Efficiency, Consistency, and Accuracy through the Inference Index (InI). Applying this to GPT-4o, OAI1, and OAI3 on Python code that implements an LSTM-based forecasting of meteorological variables, the study finds GPT often achieving the highest InI and faster, more consistent performance, while OAI3 closely rivals GPT and OAI1 lags behind in consistency. All AI-generated solutions yield predictions similar to a manually developed LSTM-H baseline, demonstrating that well-prompted and iteratively refined AI coding can reach expert-level results. The INFINITE framework thus provides a practical, multi-faceted tool for real-world deployment of AI-assisted coding in scientific contexts, with potential extensions to broader tasks and models.

Abstract

This study introduces a new methodology for an Inference Index (InI), called INFerence INdex In Testing model Effectiveness methodology (INFINITE), aiming to evaluate the performance of Large Language Models (LLMs) in code generation tasks. The InI index provides a comprehensive assessment focusing on three key components: efficiency, consistency, and accuracy. This approach encapsulates time-based efficiency, response quality, and the stability of model outputs, offering a thorough understanding of LLM performance beyond traditional accuracy metrics. We applied this methodology to compare OpenAI's GPT-4o (GPT), OpenAI-o1 pro (OAI1), and OpenAI-o3 mini-high (OAI3) in generating Python code for the Long-Short-Term-Memory (LSTM) model to forecast meteorological variables such as temperature, relative humidity and wind velocity. Our findings demonstrate that GPT outperforms OAI1 and performs comparably to OAI3 regarding accuracy and workflow efficiency. The study reveals that LLM-assisted code generation can produce results similar to expert-designed models with effective prompting and refinement. GPT's performance advantage highlights the benefits of widespread use and user feedback.

Evaluating Large Language Models in Code Generation: INFINITE Methodology for Defining the Inference Index

TL;DR

The paper presents INFINITE, a holistic framework for evaluating LLM-driven code generation by jointly measuring Efficiency, Consistency, and Accuracy through the Inference Index (InI). Applying this to GPT-4o, OAI1, and OAI3 on Python code that implements an LSTM-based forecasting of meteorological variables, the study finds GPT often achieving the highest InI and faster, more consistent performance, while OAI3 closely rivals GPT and OAI1 lags behind in consistency. All AI-generated solutions yield predictions similar to a manually developed LSTM-H baseline, demonstrating that well-prompted and iteratively refined AI coding can reach expert-level results. The INFINITE framework thus provides a practical, multi-faceted tool for real-world deployment of AI-assisted coding in scientific contexts, with potential extensions to broader tasks and models.

Abstract

This study introduces a new methodology for an Inference Index (InI), called INFerence INdex In Testing model Effectiveness methodology (INFINITE), aiming to evaluate the performance of Large Language Models (LLMs) in code generation tasks. The InI index provides a comprehensive assessment focusing on three key components: efficiency, consistency, and accuracy. This approach encapsulates time-based efficiency, response quality, and the stability of model outputs, offering a thorough understanding of LLM performance beyond traditional accuracy metrics. We applied this methodology to compare OpenAI's GPT-4o (GPT), OpenAI-o1 pro (OAI1), and OpenAI-o3 mini-high (OAI3) in generating Python code for the Long-Short-Term-Memory (LSTM) model to forecast meteorological variables such as temperature, relative humidity and wind velocity. Our findings demonstrate that GPT outperforms OAI1 and performs comparably to OAI3 regarding accuracy and workflow efficiency. The study reveals that LLM-assisted code generation can produce results similar to expert-designed models with effective prompting and refinement. GPT's performance advantage highlights the benefits of widespread use and user feedback.

Paper Structure

This paper contains 10 sections, 7 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Major components impacting inference and their corresponding parameters that may be recorded while executing a code-generating task.
  • Figure 2: The wind velocity field over Crete, as predicted by WRF for 00:00 GMT on 27 October 2019, is illustrated using both colors (indicating magnitude) and arrows that represent wind direction and strength. The shaft of the arrow points in the direction of the wind, while the barbs denote magnitude (short for 5 knots and long for 10 knots) and are positioned on the side of the shaft from which the wind originates knots.
  • Figure 3: Temperature predictions from all three models and comparisons with ground truth and LSTM-H predictions. LSTM_H is the model developed by the authors, LSTM_GPT is the GPT-generated model, LSTM_OAI1 is the OAI1-generated model, and LSTM_OAI3 is the OAI3-generated model. The top graph represents the entire testing dataset. The two bottom graphs concentrate on specific time intervals, specifically between 100 and 200 ten-minute intervals (bottom left) and 4100 and 4200 ten-minute intervals (bottom right). This is done to better visualize the difference between the various predictions.
  • Figure 4: Relative humidity predictions of all three models and comparisons with ground truth and LSTM-H predictions. LSTM_H is the model developed by the authors, LSTM_GPT is the GPT-generated model, LSTM_OAI1 is the OAI1-generated model, and LSTM_OAI3 is the OAI3-generated model. The top graph is for the whole testing data set. The two bottom graphs focus on specific time intervals, namely between 100 and 200 10-minute intervals (bottom left) and 4100 and 4200 10-minute intervals (bottom right). This is done to better visualize the difference between the different predictions.
  • Figure 5: Wind speed predictions of all three models and comparisons with ground truth and LSTM-H predictions. LSTM_H is the model developed by the authors, LSTM_GPT is the GPT-generated model, LSTM_OAI1 is the OAI1-generated model, and LSTM_OAI3 is the OAI3-generated model. The top graph is for the whole testing data set. The two bottom graphs focus on specific time intervals, namely between 100 and 200 10-minute intervals (bottom left) and 4100 and 4200 10-minute intervals (bottom right). This is done in order to better visualize the difference between the different predictions.
  • ...and 1 more figures