Table of Contents
Fetching ...

Reality Bites: Assessing the Realism of Driving Scenarios with Large Language Models

Jiahui Wu, Chengjie Lu, Aitor Arrieta, Tao Yue, Shaukat Ali

TL;DR

An empir-ical evaluation of whether LLMs can as-sess whether driving scenarios generated by autonomous driving testing techniques are realistic, i.e., being aligned with real-world driving conditions shows that roads and weather conditions do influence the robustness of the LLMs.

Abstract

Large Language Models (LLMs) are demonstrating outstanding potential for tasks such as text generation, summarization, and classification. Given that such models are trained on a humongous amount of online knowledge, we hypothesize that LLMs can assess whether driving scenarios generated by autonomous driving testing techniques are realistic, i.e., being aligned with real-world driving conditions. To test this hypothesis, we conducted an empirical evaluation to assess whether LLMs are effective and robust in performing the task. This reality check is an important step towards devising LLM-based autonomous driving testing techniques. For our empirical evaluation, we selected 64 realistic scenarios from \deepscenario--an open driving scenario dataset. Next, by introducing minor changes to them, we created 512 additional realistic scenarios, to form an overall dataset of 576 scenarios. With this dataset, we evaluated three LLMs (\gpt, \llama, and \mistral) to assess their robustness in assessing the realism of driving scenarios. Our results show that: (1) Overall, \gpt achieved the highest robustness compared to \llama and \mistral, consistently throughout almost all scenarios, roads, and weather conditions; (2) \mistral performed the worst consistently; (3) \llama achieved good results under certain conditions; and (4) roads and weather conditions do influence the robustness of the LLMs.

Reality Bites: Assessing the Realism of Driving Scenarios with Large Language Models

TL;DR

An empir-ical evaluation of whether LLMs can as-sess whether driving scenarios generated by autonomous driving testing techniques are realistic, i.e., being aligned with real-world driving conditions shows that roads and weather conditions do influence the robustness of the LLMs.

Abstract

Large Language Models (LLMs) are demonstrating outstanding potential for tasks such as text generation, summarization, and classification. Given that such models are trained on a humongous amount of online knowledge, we hypothesize that LLMs can assess whether driving scenarios generated by autonomous driving testing techniques are realistic, i.e., being aligned with real-world driving conditions. To test this hypothesis, we conducted an empirical evaluation to assess whether LLMs are effective and robust in performing the task. This reality check is an important step towards devising LLM-based autonomous driving testing techniques. For our empirical evaluation, we selected 64 realistic scenarios from \deepscenario--an open driving scenario dataset. Next, by introducing minor changes to them, we created 512 additional realistic scenarios, to form an overall dataset of 576 scenarios. With this dataset, we evaluated three LLMs (\gpt, \llama, and \mistral) to assess their robustness in assessing the realism of driving scenarios. Our results show that: (1) Overall, \gpt achieved the highest robustness compared to \llama and \mistral, consistently throughout almost all scenarios, roads, and weather conditions; (2) \mistral performed the worst consistently; (3) \llama achieved good results under certain conditions; and (4) roads and weather conditions do influence the robustness of the LLMs.
Paper Structure (13 sections, 4 equations, 5 figures, 6 tables)

This paper contains 13 sections, 4 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Process for generating a prompt from a DeepScenario scenario to evaluating the realism of the scenario with an LLM
  • Figure 2: Distributions of each LLM's success rate on evaluating all driving scenarios. The means represent the central tendency.
  • Figure 3: Distributions of robustness scores achieved by the LLMs (overall) for answering RQ1.a and RQ1.c. The mean represents the central tendency.
  • Figure 4: Distributions of robustness scores achieved by the LLMs (by road) for answering RQ2.a and RQ2.c. The mean represents the central tendency.
  • Figure 5: Distributions of robustness scores achieved by the LLMs (by weather condition) for answering RQ3.a and RQ3.c. The mean represents the central tendency.