Table of Contents
Fetching ...

Leveraging Metamemory Mechanisms for Enhanced Data-Free Code Generation in LLMs

Shuai Wang, Liang Ding, Yibing Zhan, Yong Luo, Zheng He, Dapeng Tao

TL;DR

This work addresses data-free code generation by introducing a metamemory-inspired framework, $M^{2}\text{WF}$, that enables LLMs to autonomously Recall relevant problems, Evaluate recall quality, Plan an implementation, and Guide the final Python3 solution for a given task. By generating and internally validating synthetic exemplars rather than relying on curated data, the method improves coding performance across both open-source and closed-source LLMs on benchmarks such as HumanEval, HumanEval+, and StudentEval. Extensive experiments show consistent gains over normal prompting and other baselines, demonstrating the approach's versatility, robustness, and applicability to multilingual coding tasks. The framework offers a scalable, data-free pathway to enhance software development workflows with LLMs, albeit with limitations related to API behavior, formatting constraints, and increased token usage.

Abstract

Automated code generation using large language models (LLMs) has gained attention due to its efficiency and adaptability. However, real-world coding tasks or benchmarks like HumanEval and StudentEval often lack dedicated training datasets, challenging existing few-shot prompting approaches that rely on reference examples. Inspired by human metamemory-a cognitive process involving recall and evaluation-we present a novel framework (namely M^2WF) for improving LLMs' one-time code generation. This approach enables LLMs to autonomously generate, evaluate, and utilize synthetic examples to enhance reliability and performance. Unlike prior methods, it minimizes dependency on curated data and adapts flexibly to various coding scenarios. Our experiments demonstrate significant improvements in coding benchmarks, offering a scalable and robust solution for data-free environments. The code and framework will be publicly available on GitHub and HuggingFace.

Leveraging Metamemory Mechanisms for Enhanced Data-Free Code Generation in LLMs

TL;DR

This work addresses data-free code generation by introducing a metamemory-inspired framework, , that enables LLMs to autonomously Recall relevant problems, Evaluate recall quality, Plan an implementation, and Guide the final Python3 solution for a given task. By generating and internally validating synthetic exemplars rather than relying on curated data, the method improves coding performance across both open-source and closed-source LLMs on benchmarks such as HumanEval, HumanEval+, and StudentEval. Extensive experiments show consistent gains over normal prompting and other baselines, demonstrating the approach's versatility, robustness, and applicability to multilingual coding tasks. The framework offers a scalable, data-free pathway to enhance software development workflows with LLMs, albeit with limitations related to API behavior, formatting constraints, and increased token usage.

Abstract

Automated code generation using large language models (LLMs) has gained attention due to its efficiency and adaptability. However, real-world coding tasks or benchmarks like HumanEval and StudentEval often lack dedicated training datasets, challenging existing few-shot prompting approaches that rely on reference examples. Inspired by human metamemory-a cognitive process involving recall and evaluation-we present a novel framework (namely M^2WF) for improving LLMs' one-time code generation. This approach enables LLMs to autonomously generate, evaluate, and utilize synthetic examples to enhance reliability and performance. Unlike prior methods, it minimizes dependency on curated data and adapts flexibly to various coding scenarios. Our experiments demonstrate significant improvements in coding benchmarks, offering a scalable and robust solution for data-free environments. The code and framework will be publicly available on GitHub and HuggingFace.
Paper Structure (20 sections, 6 equations, 6 figures, 7 tables)

This paper contains 20 sections, 6 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: A comparison of AceCoder method li2023acecoder (top) and $\text{M}^{2}\text{WF}$ (bottom). We can clearly see that the AceCoder method requires retrieving relevant examples from the training set to guide the LLM's code generation, whereas our $\text{M}^{2}\text{WF}$ method uses the knowledge of the LLM itself to provide guidance.
  • Figure 2: Metamemory workflow.
  • Figure 3: Analysis of the performance of the ChatGPT based on analogical prompting on the HumanEval benchmark chen2021evaluating. We can clearly observe that due to the incorrect analogy examples, the proportion of errors in the code generated by ChatGPT has reached 15.24%. Moreover, analogical prompting is not specifically tailored for code generation tasks.
  • Figure 4: The overall framework of metamemory workflow ($\text{M}^{2}\text{WF}$). Our $\text{M}^{2}\text{WF}$ framework is divided into four stages: 1) recalling related examples of programming problems; 2) evaluating the recalled examples of programming problems; 3) providing an implementation plan for original programming problem; 4) and Guiding LLMs to solve original programming problems based on implementation plans.
  • Figure 5: The performance of the models (i.e., Mistral-7B-Instruct-v0.2, DeepSeek-Coder-V2, and GPT-4) based on $\text{M}^{2}\text{WF}$ method in recalling $K$ examples and selecting the top $M$ confidence recall examples. During the experiment, we use a temperature of $0.8$, top-$p$=$0.95$, and $n=1$.
  • ...and 1 more figures