Table of Contents
Fetching ...

NNGPT: Rethinking AutoML with Large Language Models

Roman Kochnev, Waleed Khalid, Tolgay Atinc Uzun, Xi Zhang, Yashkumar Sanjaybhai Dhameliya, Furui Qin, Chandini Vysyaraju, Raghuvir Duvvuri, Avi Goyal, Dmitry Ignatov, Radu Timofte

TL;DR

NNGPT reframes AutoML as a closed-loop system in which a single fine-tuned LLM generates executable neural training specifications, validates and runs them, and then learns from results to improve both the generator and predictors. It unifies five pipelines—zero-shot architecture generation, hyperparameter recommendation, code-aware accuracy/early-stop prediction, NN-RAG retrieval-augmented patching, and reinforcement learning updates—into an end-to-end workflow anchored by the LEMUR corpus. Empirical results show substantial growth of the LEMUR dataset, a 73% executability rate for retrieved blocks, competitive one-shot hyperparameter generation versus traditional search, and robust code-aware accuracy prediction enabling early stopping and scheduling. The framework achieves practical AutoML gains on mid-scale vision benchmarks and demonstrates strong reproducibility through a shared Dockerized stack and open-source artifacts, paving the way for community-driven, autonomous AutoML experimentation.

Abstract

Building self-improving AI systems remains a fundamental challenge in the AI domain. We present NNGPT, an open-source framework that turns a large language model (LLM) into a self-improving AutoML engine for neural network development, primarily for computer vision. Unlike previous frameworks, NNGPT extends the dataset of neural networks by generating new models, enabling continuous fine-tuning of LLMs based on closed-loop system of generation, assessment, and self-improvement. It integrates within one unified workflow five synergistic LLM-based pipelines: zero-shot architecture synthesis, hyperparameter optimization (HPO), code-aware accuracy/early-stop prediction, retrieval-augmented synthesis of scope-closed PyTorch blocks (NN-RAG), and reinforcement learning. Built on the LEMUR dataset as an audited corpus with reproducible metrics, NNGPT emits from a single prompt and validates network architecture, preprocessing code, and hyperparameters, executes them end-to-end, and learns from result. The PyTorch adapter makes NNGPT framework-agnostic, enabling strong performance: NN-RAG achieves 73% executability on 1,289 targets, 3-shot prompting boosts accuracy on common datasets, and hash-based deduplication saves hundreds of runs. One-shot prediction matches search-based AutoML, reducing the need for numerous trials. HPO on LEMUR achieves RMSE 0.60, outperforming Optuna (0.64), while the code-aware predictor reaches RMSE 0.14 with Pearson r=0.78. The system has already generated over 5K validated models, proving NNGPT as an autonomous AutoML engine. Upon acceptance, the code, prompts, and checkpoints will be released for public access to enable reproducibility and facilitate community usage.

NNGPT: Rethinking AutoML with Large Language Models

TL;DR

NNGPT reframes AutoML as a closed-loop system in which a single fine-tuned LLM generates executable neural training specifications, validates and runs them, and then learns from results to improve both the generator and predictors. It unifies five pipelines—zero-shot architecture generation, hyperparameter recommendation, code-aware accuracy/early-stop prediction, NN-RAG retrieval-augmented patching, and reinforcement learning updates—into an end-to-end workflow anchored by the LEMUR corpus. Empirical results show substantial growth of the LEMUR dataset, a 73% executability rate for retrieved blocks, competitive one-shot hyperparameter generation versus traditional search, and robust code-aware accuracy prediction enabling early stopping and scheduling. The framework achieves practical AutoML gains on mid-scale vision benchmarks and demonstrates strong reproducibility through a shared Dockerized stack and open-source artifacts, paving the way for community-driven, autonomous AutoML experimentation.

Abstract

Building self-improving AI systems remains a fundamental challenge in the AI domain. We present NNGPT, an open-source framework that turns a large language model (LLM) into a self-improving AutoML engine for neural network development, primarily for computer vision. Unlike previous frameworks, NNGPT extends the dataset of neural networks by generating new models, enabling continuous fine-tuning of LLMs based on closed-loop system of generation, assessment, and self-improvement. It integrates within one unified workflow five synergistic LLM-based pipelines: zero-shot architecture synthesis, hyperparameter optimization (HPO), code-aware accuracy/early-stop prediction, retrieval-augmented synthesis of scope-closed PyTorch blocks (NN-RAG), and reinforcement learning. Built on the LEMUR dataset as an audited corpus with reproducible metrics, NNGPT emits from a single prompt and validates network architecture, preprocessing code, and hyperparameters, executes them end-to-end, and learns from result. The PyTorch adapter makes NNGPT framework-agnostic, enabling strong performance: NN-RAG achieves 73% executability on 1,289 targets, 3-shot prompting boosts accuracy on common datasets, and hash-based deduplication saves hundreds of runs. One-shot prediction matches search-based AutoML, reducing the need for numerous trials. HPO on LEMUR achieves RMSE 0.60, outperforming Optuna (0.64), while the code-aware predictor reaches RMSE 0.14 with Pearson r=0.78. The system has already generated over 5K validated models, proving NNGPT as an autonomous AutoML engine. Upon acceptance, the code, prompts, and checkpoints will be released for public access to enable reproducibility and facilitate community usage.

Paper Structure

This paper contains 43 sections, 4 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Overview of the NNGPT Pipeline: starting with a query to the LEMUR API to retrieve a neural network entry and its metadata, the system constructs a prompt, using an LLM generates the code of neural network model, and trains it. Artifacts are logged in a structured database, while LoRA fine-tuning continuously updates the LLM based on training results, creating a self-improving AutoML loop.
  • Figure 2: Detailed view of pipeline stages 3–7 in NNGPT. Left to right: Configuration Setup & Validation (default or HF-loaded LLM, PEFT/quantization); Prompt Assembly & Injection (templated prompts with task data); LLM-Guided Architecture Generation (one-shot code and hyperparameters); Model Validation & Evaluation (format checks, training, logging); LoRA Fine-tuning (update adapters from train/*.json). All pipelines reuse this stack; variation mainly occurs in the Prompt stage.