Table of Contents
Fetching ...

HCRMP: A LLM-Hinted Contextual Reinforcement Learning Framework for Autonomous Driving

Zhiwen Chen, Bo Leng, Zhuoren Li, Hanming Deng, Guizhe Jin, Ran Yu, Huanxi Wen

TL;DR

This work addresses the vulnerability of autonomous driving systems that rely on large language models (LLMs) by proposing a LLM-Hinted Reinforcement Learning paradigm. The core idea is to decouple LLM outputs from direct policy control and instead use the LLM to provide semantic hints that augment state information and influence policy optimization through a multi-critic framework, thereby mitigating hallucination effects. The HCRMP architecture comprises three modules: Augmented Semantic Representation (ASR) to expand the state with LLM-derived semantics, Contextual Stability Anchor (CSA) to produce reliable multi-critic weights via retrieval-augmented knowledge, and Semantic Cache Module (SCM) to handle LLM latency with a historical semantic memory. Empirical results in CARLA Town 2 show that HCRMP achieves high task success rates (up to 80.3%) across varied traffic densities and significantly reduces collisions (11.4%) in safety-critical scenarios, outperforming baseline LLM-Dominated RL methods. These findings demonstrate that a weakly coupled LLM-RL system can exploit LLM strengths in reasoning and context while preserving the RL agent’s autonomous, stable learning for robust autonomous driving.

Abstract

Integrating Large Language Models (LLMs) with Reinforcement Learning (RL) can enhance autonomous driving (AD) performance in complex scenarios. However, current LLM-Dominated RL methods over-rely on LLM outputs, which are prone to hallucinations. Evaluations show that state-of-the-art LLM indicates a non-hallucination rate of only approximately 57.95% when assessed on essential driving-related tasks. Thus, in these methods, hallucinations from the LLM can directly jeopardize the performance of driving policies. This paper argues that maintaining relative independence between the LLM and the RL is vital for solving the hallucinations problem. Consequently, this paper is devoted to propose a novel LLM-Hinted RL paradigm. The LLM is used to generate semantic hints for state augmentation and policy optimization to assist RL agent in motion planning, while the RL agent counteracts potential erroneous semantic indications through policy learning to achieve excellent driving performance. Based on this paradigm, we propose the HCRMP (LLM-Hinted Contextual Reinforcement Learning Motion Planner) architecture, which is designed that includes Augmented Semantic Representation Module to extend state space. Contextual Stability Anchor Module enhances the reliability of multi-critic weight hints by utilizing information from the knowledge base. Semantic Cache Module is employed to seamlessly integrate LLM low-frequency guidance with RL high-frequency control. Extensive experiments in CARLA validate HCRMP's strong overall driving performance. HCRMP achieves a task success rate of up to 80.3% under diverse driving conditions with different traffic densities. Under safety-critical driving conditions, HCRMP significantly reduces the collision rate by 11.4%, which effectively improves the driving performance in complex scenarios.

HCRMP: A LLM-Hinted Contextual Reinforcement Learning Framework for Autonomous Driving

TL;DR

This work addresses the vulnerability of autonomous driving systems that rely on large language models (LLMs) by proposing a LLM-Hinted Reinforcement Learning paradigm. The core idea is to decouple LLM outputs from direct policy control and instead use the LLM to provide semantic hints that augment state information and influence policy optimization through a multi-critic framework, thereby mitigating hallucination effects. The HCRMP architecture comprises three modules: Augmented Semantic Representation (ASR) to expand the state with LLM-derived semantics, Contextual Stability Anchor (CSA) to produce reliable multi-critic weights via retrieval-augmented knowledge, and Semantic Cache Module (SCM) to handle LLM latency with a historical semantic memory. Empirical results in CARLA Town 2 show that HCRMP achieves high task success rates (up to 80.3%) across varied traffic densities and significantly reduces collisions (11.4%) in safety-critical scenarios, outperforming baseline LLM-Dominated RL methods. These findings demonstrate that a weakly coupled LLM-RL system can exploit LLM strengths in reasoning and context while preserving the RL agent’s autonomous, stable learning for robust autonomous driving.

Abstract

Integrating Large Language Models (LLMs) with Reinforcement Learning (RL) can enhance autonomous driving (AD) performance in complex scenarios. However, current LLM-Dominated RL methods over-rely on LLM outputs, which are prone to hallucinations. Evaluations show that state-of-the-art LLM indicates a non-hallucination rate of only approximately 57.95% when assessed on essential driving-related tasks. Thus, in these methods, hallucinations from the LLM can directly jeopardize the performance of driving policies. This paper argues that maintaining relative independence between the LLM and the RL is vital for solving the hallucinations problem. Consequently, this paper is devoted to propose a novel LLM-Hinted RL paradigm. The LLM is used to generate semantic hints for state augmentation and policy optimization to assist RL agent in motion planning, while the RL agent counteracts potential erroneous semantic indications through policy learning to achieve excellent driving performance. Based on this paradigm, we propose the HCRMP (LLM-Hinted Contextual Reinforcement Learning Motion Planner) architecture, which is designed that includes Augmented Semantic Representation Module to extend state space. Contextual Stability Anchor Module enhances the reliability of multi-critic weight hints by utilizing information from the knowledge base. Semantic Cache Module is employed to seamlessly integrate LLM low-frequency guidance with RL high-frequency control. Extensive experiments in CARLA validate HCRMP's strong overall driving performance. HCRMP achieves a task success rate of up to 80.3% under diverse driving conditions with different traffic densities. Under safety-critical driving conditions, HCRMP significantly reduces the collision rate by 11.4%, which effectively improves the driving performance in complex scenarios.

Paper Structure

This paper contains 17 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: LLM performance evaluation and hallucination impact on LLM-RL methods. Figure (a) shows the SOTA LLM's non-hallucination rates across five driving tasks. Figure (b) illustrates the LLM-Dominated RL methods. LLM hallucinations directly distort the RL agent's Q-value estimation and degrade policy efficiency, leading to unfeasible driving actions. Figure (c) presents the LLM-Hinted RL method. LLM provides semantic hints to the RL agent instead of directly dictating decisions. The RL agent, through its own policy learning process, effectively buffers the negative impact of these hallucinations, thereby preventing unfeasible actions.
  • Figure 2: The framework of our proposed HCRMP. LLM acts in the Augmented Semantic Representation module to fetch information at the scenario level and object level, extending the state space. Meanwhile, LLM acts in the Contextual Stability Anchor module to generate reliable weights between multi critics, utilizing the knowledge base to mitigate the output fluctuations. When LLM fails to provide timely guidance, the Semantic Cache module replaces the missing weights by retrieving the most similar historical driving conditions. Hints from LLM are ultimately input to the RL agent's actor and multi-critic networks for optimal policy learning.
  • Figure 3: HCRMP variants rewards: dynamic trends and statistical distributions. Figure (a) visualizes that the dynamic reward curve for HCRMP without CSA exhibits significant fluctuations, indicative of performance instability, while the curve for HCRMP with CSA shows smaller fluctuations and more stable performance. Figure (b) illustrates that the overall reward distribution for the variant without CSA is wider and more pronounced in lower reward regions, whereas the CSA variant's rewards are more concentrated in higher value ranges with a more prominent peak.