Table of Contents
Fetching ...

Low-Cost Language Models: Survey and Performance Evaluation on Python Code Generation

Jessica López Espejel, Mahaman Sanoussi Yahaya Alassan, Merieme Bouhandi, Walid Dahhane, El Hassane Ettifouri

TL;DR

This work investigates the viability of CPU-friendly language models for Python code generation, introducing a 60-problem dataset with three difficulty levels and a Chain-of-Thought prompting strategy to improve reasoning and code quality. It employs a semi-manual evaluation framework within the llama.cpp ecosystem to compare CPU-quantized models against large closed models, revealing that certain open-source CPUs models can approach or even match larger systems on code tasks, albeit with formatting and prompt-adherence tradeoffs. The study also benchmarks on HumanEval and EvalPlus to contextualize performance across datasets, and discusses the practical implications of CPU-based code generation, including resource usage and inference time. Overall, the work demonstrates the growing feasibility of running sophisticated code-generation models on standard hardware, offers a public dataset and prompts, and outlines future directions for broader, cross-language evaluation and automated prompt optimization.

Abstract

Large Language Models (LLMs) have become a popular choice for many Natural Language Processing (NLP) tasks due to their versatility and ability to produce high-quality results. Specifically, they are increasingly used for automatic code generation to help developers tackle repetitive coding tasks. However, LLMs' substantial computational and memory requirements often make them inaccessible to users with limited resources. This paper focuses on very low-cost models which offer a more accessible alternative to resource-intensive LLMs. We notably: (1) propose a thorough semi-manual evaluation of their performance in generating Python code, (2) introduce a Chain-of-Thought (CoT) prompting strategy to improve model reasoning and code quality, and (3) propose a new dataset of 60 programming problems, with varied difficulty levels, designed to extend existing benchmarks like HumanEval and EvalPlus. Our findings show that some low-cost compatible models achieve competitive results compared to larger models like ChatGPT despite using significantly fewer resources. We will make our dataset and prompts publicly available to support further research.

Low-Cost Language Models: Survey and Performance Evaluation on Python Code Generation

TL;DR

This work investigates the viability of CPU-friendly language models for Python code generation, introducing a 60-problem dataset with three difficulty levels and a Chain-of-Thought prompting strategy to improve reasoning and code quality. It employs a semi-manual evaluation framework within the llama.cpp ecosystem to compare CPU-quantized models against large closed models, revealing that certain open-source CPUs models can approach or even match larger systems on code tasks, albeit with formatting and prompt-adherence tradeoffs. The study also benchmarks on HumanEval and EvalPlus to contextualize performance across datasets, and discusses the practical implications of CPU-based code generation, including resource usage and inference time. Overall, the work demonstrates the growing feasibility of running sophisticated code-generation models on standard hardware, offers a public dataset and prompts, and outlines future directions for broader, cross-language evaluation and automated prompt optimization.

Abstract

Large Language Models (LLMs) have become a popular choice for many Natural Language Processing (NLP) tasks due to their versatility and ability to produce high-quality results. Specifically, they are increasingly used for automatic code generation to help developers tackle repetitive coding tasks. However, LLMs' substantial computational and memory requirements often make them inaccessible to users with limited resources. This paper focuses on very low-cost models which offer a more accessible alternative to resource-intensive LLMs. We notably: (1) propose a thorough semi-manual evaluation of their performance in generating Python code, (2) introduce a Chain-of-Thought (CoT) prompting strategy to improve model reasoning and code quality, and (3) propose a new dataset of 60 programming problems, with varied difficulty levels, designed to extend existing benchmarks like HumanEval and EvalPlus. Our findings show that some low-cost compatible models achieve competitive results compared to larger models like ChatGPT despite using significantly fewer resources. We will make our dataset and prompts publicly available to support further research.
Paper Structure (37 sections, 1 equation, 19 figures, 4 tables)

This paper contains 37 sections, 1 equation, 19 figures, 4 tables.

Figures (19)

  • Figure 1: Overview of Python code generation with our system. The input consist of a $problem$, $variables$, and $options$. We feed it to the language model along with an engineered prompt to guide the code generation. The language model generates the equivalent Python code.
  • Figure 2: Hierarchical organization of all models (CPU, GPU, and Chatbots) examined in the experiments conducted in this paper.
  • Figure 3: Examples of different programming languages.
  • Figure 4: Example solution for the easy problem from Figure \ref{['fig:exampless']}.
  • Figure 5: Proposed Prompt to evaluate our results
  • ...and 14 more figures