Table of Contents
Fetching ...

Are LLMs The Way Forward? A Case Study on LLM-Guided Reinforcement Learning for Decentralized Autonomous Driving

Timur Anvar, Jeffrey Chen, Yuyan Wang, Rohan Chandra

TL;DR

The paper investigates whether small, locally deployed LLMs can meaningfully augment RL for autonomous highway driving through reward shaping rather than direct control. It compares RL-only, LLM-only, and hybrid RL+LLM approaches using two open-weight models (Qwen3-14B and Gemma3-12B) across highway-fast, highway, and merge scenarios with dense, averaged, and centered shaping schemes. The findings show RL-only achieves moderate safety with SR between $63 ext{ extpercent}$ and $89 ext{ extpercent}$, LLM-only reaches up to $94 ext{ extpercent}$ SR but at severely degraded speeds, and hybrids yield intermediate performance with a persistent conservative bias and high model-dependence. The results underscore both the promise and limitations of on-device LLMs for safety-critical control, suggesting that future work with larger models, richer observations, and alternative shaping strategies could reconcile safety with higher efficiency in decentralized autonomous driving.

Abstract

Autonomous vehicle navigation in complex environments such as dense and fast-moving highways and merging scenarios remains an active area of research. A key limitation of RL is its reliance on well-specified reward functions, which often fail to capture the full semantic and social complexity of diverse, out-of-distribution situations. As a result, a rapidly growing line of research explores using Large Language Models (LLMs) to replace or supplement RL for direct planning and control, on account of their ability to reason about rich semantic context. However, LLMs present significant drawbacks: they can be unstable in zero-shot safety-critical settings, produce inconsistent outputs, and often depend on expensive API calls with network latency. This motivates our investigation into whether small, locally deployed LLMs (< 14B parameters) can meaningfully support autonomous highway driving through reward shaping rather than direct control. We present a case study comparing RL-only, LLM-only, and hybrid approaches, where LLMs augment RL rewards by scoring state-action transitions during training, while standard RL policies execute at test time. Our findings reveal that RL-only agents achieve moderate success rates (73-89%) with reasonable efficiency, LLM-only agents can reach higher success rates (up to 94%) but with severely degraded speed performance, and hybrid approaches consistently fall between these extremes. Critically, despite explicit efficiency instructions, LLM-influenced approaches exhibit systematic conservative bias with substantial model-dependent variability, highlighting important limitations of current small LLMs for safety-critical control tasks.

Are LLMs The Way Forward? A Case Study on LLM-Guided Reinforcement Learning for Decentralized Autonomous Driving

TL;DR

The paper investigates whether small, locally deployed LLMs can meaningfully augment RL for autonomous highway driving through reward shaping rather than direct control. It compares RL-only, LLM-only, and hybrid RL+LLM approaches using two open-weight models (Qwen3-14B and Gemma3-12B) across highway-fast, highway, and merge scenarios with dense, averaged, and centered shaping schemes. The findings show RL-only achieves moderate safety with SR between and , LLM-only reaches up to SR but at severely degraded speeds, and hybrids yield intermediate performance with a persistent conservative bias and high model-dependence. The results underscore both the promise and limitations of on-device LLMs for safety-critical control, suggesting that future work with larger models, richer observations, and alternative shaping strategies could reconcile safety with higher efficiency in decentralized autonomous driving.

Abstract

Autonomous vehicle navigation in complex environments such as dense and fast-moving highways and merging scenarios remains an active area of research. A key limitation of RL is its reliance on well-specified reward functions, which often fail to capture the full semantic and social complexity of diverse, out-of-distribution situations. As a result, a rapidly growing line of research explores using Large Language Models (LLMs) to replace or supplement RL for direct planning and control, on account of their ability to reason about rich semantic context. However, LLMs present significant drawbacks: they can be unstable in zero-shot safety-critical settings, produce inconsistent outputs, and often depend on expensive API calls with network latency. This motivates our investigation into whether small, locally deployed LLMs (< 14B parameters) can meaningfully support autonomous highway driving through reward shaping rather than direct control. We present a case study comparing RL-only, LLM-only, and hybrid approaches, where LLMs augment RL rewards by scoring state-action transitions during training, while standard RL policies execute at test time. Our findings reveal that RL-only agents achieve moderate success rates (73-89%) with reasonable efficiency, LLM-only agents can reach higher success rates (up to 94%) but with severely degraded speed performance, and hybrid approaches consistently fall between these extremes. Critically, despite explicit efficiency instructions, LLM-influenced approaches exhibit systematic conservative bias with substantial model-dependent variability, highlighting important limitations of current small LLMs for safety-critical control tasks.

Paper Structure

This paper contains 27 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: DQN hyperparameters used in our RL-only experiments.
  • Figure 2: RL-only safety–efficiency spectrum. Longer training increases safety while reducing speed.
  • Figure 3: Text prompt used in our LLM-only experiments.
  • Figure 4: Comparison of reward shaping schemes across success rate (SR), lane changes (LC), and speed score. The figure highlights the spread and variability of outcomes under each scheme, complementing the aggregate results in Table \ref{['tab:hybrid_all_rewards']}.
  • Figure 5: Safety–efficiency spectrum across approaches. Each point is one configuration (environment $\times$ training steps $\times$ model/scheme), plotted by success rate vs. speed score.
  • ...and 1 more figures