LangProp: A code optimization framework using Large Language Models applied to driving

Shu Ishida; Gianluca Corrado; George Fedoseev; Hudson Yeo; Lloyd Russell; Jamie Shotton; João F. Henriques; Anthony Hu

LangProp: A code optimization framework using Large Language Models applied to driving

Shu Ishida, Gianluca Corrado, George Fedoseev, Hudson Yeo, Lloyd Russell, Jamie Shotton, João F. Henriques, Anthony Hu

TL;DR

LangProp reframes code generation as a metric-driven optimization problem where an LLM generates executable policy scripts and a trainer iteratively evaluates, prioritizes, and updates them using data and environment feedback. By employing a policy tracker, priority-based reranking, and a flexible prompt-template engine, LangProp translates traditional ML training paradigms (imitation learning, DAgger, RL) into the realm of code optimization. The framework is demonstrated on Sudoku, CartPole, and CARLA autonomous driving, showing improved performance and interpretable policies compared to zero-shot generation and several baselines. This work suggests a path toward transparent, data-driven code repair and policy learning where LLMs act as optimizers rather than just generators, with practical implications for robotics and automated driving.

Abstract

We propose LangProp, a framework for iteratively optimizing code generated by large language models (LLMs), in both supervised and reinforcement learning settings. While LLMs can generate sensible coding solutions zero-shot, they are often sub-optimal. Especially for code generation tasks, it is likely that the initial code will fail on certain edge cases. LangProp automatically evaluates the code performance on a dataset of input-output pairs, catches any exceptions, and feeds the results back to the LLM in the training loop, so that the LLM can iteratively improve the code it generates. By adopting a metric- and data-driven training paradigm for this code optimization procedure, one could easily adapt findings from traditional machine learning techniques such as imitation learning, DAgger, and reinforcement learning. We show LangProp's applicability to general domains such as Sudoku and CartPole, as well as demonstrate the first proof of concept of automated code optimization for autonomous driving in CARLA. We show that LangProp can generate interpretable and transparent policies that can be verified and improved in a metric- and data-driven way. Our code is available at https://github.com/shuishida/LangProp.

LangProp: A code optimization framework using Large Language Models applied to driving

TL;DR

Abstract

Paper Structure (54 sections, 2 equations, 4 figures, 2 tables)

This paper contains 54 sections, 2 equations, 4 figures, 2 tables.

Introduction
Related work
LLMs for code generation
LLMs for automating compositional tasks
The LangProp Framework
Model definition
Policy setup
Training objective
Forward-pass and feedback
Priority
Policy reranking and update
Prompt template engine
Training paradigm
Experiments
Generalized Sudoku
...and 39 more sections

Figures (4)

Figure 1: An overview of the LangProp framework, which consists of a LangProp model, an LLM optimizer, and a LangProp trainer. During training, the LLM generates and updates the policy scripts which are evaluated against a training objective. The performances of the policies are monitored and aggregated over time by a policy tracker as priorities, which is then used to rerank the policies. Policies with higher priorities are selected for updates, and the best policy is used for inference.
Figure 2: The total number of environment steps required to learn CartPole-v1 ($10$ seeds per method) in comparison to a RL method (PPO). Most seeds converged to an optimal solution within $10$ LangProp updates.
Figure 3: An overview of the LangProp agent training pipeline. The LangProp model is updated on a dataset that includes both offline expert data as well as online LangProp data annotated with expert actions, similar to DAgger. The agent is given negative rewards upon infraction.
Figure 4: Training curves for the different training methods of the LangProp agent. The training scores are evaluated on $1000$ samples from the offline training dataset and/or online replay buffer, and the validation scores are evaluated on $1000$ samples from the offline validation dataset. Updates are performed every $1000$ frames of agent driving, and upon infractions in the RL setting. The score is in the range of $[-10, 1]$ due to exception penalties. We limit the axis to $[-1, 1]$ in the plots.

LangProp: A code optimization framework using Large Language Models applied to driving

TL;DR

Abstract

LangProp: A code optimization framework using Large Language Models applied to driving

Authors

TL;DR

Abstract

Table of Contents

Figures (4)