LearningFlow: Automated Policy Learning Workflow for Urban Driving with Large Language Models

Zengqi Peng; Yubin Wang; Xu Han; Lei Zheng; Jun Ma

LearningFlow: Automated Policy Learning Workflow for Urban Driving with Large Language Models

Zengqi Peng, Yubin Wang, Xu Han, Lei Zheng, Jun Ma

TL;DR

LearningFlow presents a closed-loop workflow that uses multiple LLM agents to automatically generate and refine training curricula and reward functions for reinforcement learning in urban driving. It integrates a memory-augmented reasoning framework with curriculum and reward analysis/generation modules and a reflection loop, enabling online adaptation during policy training. Through CARLA Town06 experiments with PPO and Ablations, LearningFlow achieves superior success rates and robust generalization across driving tasks and different RL algorithms, while ablations show the critical role of the analysis agents. The approach reduces manual reward design, improves sample efficiency, and demonstrates flexible compatibility with multiple RL algorithms, with future work pointing toward enhanced multimodal decision-making via diffusion models.

Abstract

Recent advancements in reinforcement learning (RL) demonstrate the significant potential in autonomous driving. Despite this promise, challenges such as the manual design of reward functions and low sample efficiency in complex environments continue to impede the development of safe and effective driving policies. To tackle these issues, we introduce LearningFlow, an innovative automated policy learning workflow tailored to urban driving. This framework leverages the collaboration of multiple large language model (LLM) agents throughout the RL training process. LearningFlow includes a curriculum sequence generation process and a reward generation process, which work in tandem to guide the RL policy by generating tailored training curricula and reward functions. Particularly, each process is supported by an analysis agent that evaluates training progress and provides critical insights to the generation agent. Through the collaborative efforts of these LLM agents, LearningFlow automates policy learning across a series of complex driving tasks, and it significantly reduces the reliance on manual reward function design while enhancing sample efficiency. Comprehensive experiments are conducted in the high-fidelity CARLA simulator, along with comparisons with other existing methods, to demonstrate the efficacy of our proposed approach. The results demonstrate that LearningFlow excels in generating rewards and curricula. It also achieves superior performance and robust generalization across various driving tasks, as well as commendable adaptation to different RL algorithms.

LearningFlow: Automated Policy Learning Workflow for Urban Driving with Large Language Models

TL;DR

Abstract

Paper Structure (32 sections, 15 equations, 6 figures, 2 tables)

This paper contains 32 sections, 15 equations, 6 figures, 2 tables.

Introduction
Related Work
Reward Design for Deep Reinforcement Learning
Training Efficiency for Deep Reinforcement Learning
Large Language Model Applications
LLMs for Embodied Inference
LLMs for Policy Learning
Problem Formulation
Problem Statement
Learning Environment
Curriculum Sequence Generation Problem
Methodology
Overview of the LearningFlow
Memory Module for Closed-Loop Policy Training Workflow
Iterative Curriculum Sequence Generation
...and 17 more sections

Figures (6)

Figure 1: The proposed LLM-in-the-training-loop CRL training paradigm. The Multi-LLM-agent system generates reward functions and training curriculum sequences for the downstream RL policy through the collaboration of multiple LLM-based agents. The historical training data generated through interactions between the RL policy and the environment is stored in a memory module and then fed back to the Multi-LLM-agent system as reference information for subsequent generation steps.
Figure 2: Overview of the LearningFlow framework for automated driving policy learning with interactive SVs. In the reasoning module, the analysis agents process prompts containing context descriptors, historical training information, and task objectives to perform inference, providing task analysis as a reference for the generation agent. The generation agent, based on the analysis results and other relevant prompts, selects training curricula and generates reward functions to initialize the downstream training environment and RL agent. After initialization, the RL agent interacts with the environment and records training data. Upon completing a certain number of episodes, the training data, along with the decision contents from the LLM agents, are stored in the memory module. These records are then summarized by the reflection module and fed back to the agents in the reasoning module to support the next round of inference.
Figure 3: Representative segments of the curriculum generation process demonstration during the initial training phase.
Figure 4: Representative segments of the reward generation process demonstration during the initial training phase.
Figure 5: A failure case of reward generation process by LearningFlow without analysis agents, where the design flaws are highlighted in bold red.
...and 1 more figures

LearningFlow: Automated Policy Learning Workflow for Urban Driving with Large Language Models

TL;DR

Abstract

LearningFlow: Automated Policy Learning Workflow for Urban Driving with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)