Table of Contents
Fetching ...

AICoderEval: Improving AI Domain Code Generation of Large Language Models

Yinghui Xia, Yuyan Chen, Tianyu Shi, Jun Wang, Jinsong Yang

TL;DR

This work tackles the gap in evaluating AI-domain code generation by large language models. It introduces AICoderEval, a domain-aware benchmark, and CoderGen, an agent-based framework that generates and refines library-specific code, culminating in the AICoder model refined via LoRA. Empirical results show significant improvements in pass@1 metrics and state-of-the-art performance on the benchmark, driven by error-traceback analysis, iterative regeneration, and domain-focused fine-tuning. The framework and dataset offer practical tools to assess and advance models for real-world AI software development involving HuggingFace, PyTorch, and TensorFlow libraries.

Abstract

Automated code generation is a pivotal capability of large language models (LLMs). However, assessing this capability in real-world scenarios remains challenging. Previous methods focus more on low-level code generation, such as model loading, instead of generating high-level codes catering for real-world tasks, such as image-to-text, text classification, in various domains. Therefore, we construct AICoderEval, a dataset focused on real-world tasks in various domains based on HuggingFace, PyTorch, and TensorFlow, along with comprehensive metrics for evaluation and enhancing LLMs' task-specific code generation capability. AICoderEval contains test cases and complete programs for automated evaluation of these tasks, covering domains such as natural language processing, computer vision, and multimodal learning. To facilitate research in this area, we open-source the AICoderEval dataset at \url{https://huggingface.co/datasets/vixuowis/AICoderEval}. After that, we propose CoderGen, an agent-based framework, to help LLMs generate codes related to real-world tasks on the constructed AICoderEval. Moreover, we train a more powerful task-specific code generation model, named AICoder, which is refined on llama-3 based on AICoderEval. Our experiments demonstrate the effectiveness of CoderGen in improving LLMs' task-specific code generation capability (by 12.00\% on pass@1 for original model and 9.50\% on pass@1 for ReAct Agent). AICoder also outperforms current code generation LLMs, indicating the great quality of the AICoderEval benchmark.

AICoderEval: Improving AI Domain Code Generation of Large Language Models

TL;DR

This work tackles the gap in evaluating AI-domain code generation by large language models. It introduces AICoderEval, a domain-aware benchmark, and CoderGen, an agent-based framework that generates and refines library-specific code, culminating in the AICoder model refined via LoRA. Empirical results show significant improvements in pass@1 metrics and state-of-the-art performance on the benchmark, driven by error-traceback analysis, iterative regeneration, and domain-focused fine-tuning. The framework and dataset offer practical tools to assess and advance models for real-world AI software development involving HuggingFace, PyTorch, and TensorFlow libraries.

Abstract

Automated code generation is a pivotal capability of large language models (LLMs). However, assessing this capability in real-world scenarios remains challenging. Previous methods focus more on low-level code generation, such as model loading, instead of generating high-level codes catering for real-world tasks, such as image-to-text, text classification, in various domains. Therefore, we construct AICoderEval, a dataset focused on real-world tasks in various domains based on HuggingFace, PyTorch, and TensorFlow, along with comprehensive metrics for evaluation and enhancing LLMs' task-specific code generation capability. AICoderEval contains test cases and complete programs for automated evaluation of these tasks, covering domains such as natural language processing, computer vision, and multimodal learning. To facilitate research in this area, we open-source the AICoderEval dataset at \url{https://huggingface.co/datasets/vixuowis/AICoderEval}. After that, we propose CoderGen, an agent-based framework, to help LLMs generate codes related to real-world tasks on the constructed AICoderEval. Moreover, we train a more powerful task-specific code generation model, named AICoder, which is refined on llama-3 based on AICoderEval. Our experiments demonstrate the effectiveness of CoderGen in improving LLMs' task-specific code generation capability (by 12.00\% on pass@1 for original model and 9.50\% on pass@1 for ReAct Agent). AICoder also outperforms current code generation LLMs, indicating the great quality of the AICoderEval benchmark.
Paper Structure (17 sections, 3 figures, 4 tables)

This paper contains 17 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The AICoder generated by our CoderGen framework is capable of programming for domain-specific tasks and selecting the appropriate libraries for invocation. In part A depicts the output generated by codellama-7b-python, which incorrectly invoked a library using the pipeline method. In contrast, the part B presents the results produced by the AICoder, accurately selecting and calling the appropriate library to fulfill the requirements.
  • Figure 2: CoderGen: A Domain-Specific Code Generation Architecture. This architecture comprises two integral components. On the left side, AICoderEval data is produced by analyzing library documentation with provided document data (model meta-information). This data, which includes testable programs, is subsequently validated within an execution environment. We then utilize this data to train a LLM (AICoder in following paper). On the right side, an LLM-based agent is employed to direct the code generation process. Actual executable environments are utilized to push feedback to both the agent and the LLM, aiding in the refinement of the generated code.
  • Figure 3: Error traceback analyze example