Table of Contents
Fetching ...

Exploring Critical Testing Scenarios for Decision-Making Policies: An LLM Approach

Weichao Xu, Huaxin Pei, Jingxuan Yang, Yuchen Shi, Yi Zhang, Qianchuan Zhao

TL;DR

The paper presents LLMTester, an online testing framework that uses a generate-test-feedback loop to uncover critical and diverse failure scenarios for decision-making policies. It couples an LLM-based scenario generator with a four-module framework (scenario database, generator, testing, and evaluation) and a multi-scale strategy that combines large-scale LLM mutations with small random edits guided by scenario potential. Across five policies and four environments, LLMTester discovers more failures and richer scenario diversity than baselines, and its efficiency is improved further by adaptive thresholding and potential analysis. The approach is robust to different LLMs and demonstrates practical utility for testing autonomous systems and robotics under realistic, complex settings.

Abstract

Recent advances in decision-making policies have led to significant progress in fields such as autonomous driving and robotics. However, testing these policies remains crucial with the existence of critical scenarios that may threaten their reliability. Despite ongoing research, challenges such as low testing efficiency and limited diversity persist due to the complexity of the decision-making policies and their environments. To address these challenges, this paper proposes an adaptable Large Language Model (LLM)-driven online testing framework to explore critical and diverse testing scenarios for decision-making policies. Specifically, we design a "generate-test-feedback" pipeline with templated prompt engineering to harness the world knowledge and reasoning abilities of LLMs. Additionally, a multi-scale scenario generation strategy is proposed to address the limitations of LLMs in making fine-grained adjustments, further enhancing testing efficiency. Finally, the proposed LLM-driven method is evaluated on five widely recognized benchmarks, and the experimental results demonstrate that our method significantly outperforms baseline methods in uncovering both critical and diverse scenarios. These findings suggest that LLM-driven methods hold significant promise for advancing the testing of decision-making policies.

Exploring Critical Testing Scenarios for Decision-Making Policies: An LLM Approach

TL;DR

The paper presents LLMTester, an online testing framework that uses a generate-test-feedback loop to uncover critical and diverse failure scenarios for decision-making policies. It couples an LLM-based scenario generator with a four-module framework (scenario database, generator, testing, and evaluation) and a multi-scale strategy that combines large-scale LLM mutations with small random edits guided by scenario potential. Across five policies and four environments, LLMTester discovers more failures and richer scenario diversity than baselines, and its efficiency is improved further by adaptive thresholding and potential analysis. The approach is robust to different LLMs and demonstrates practical utility for testing autonomous systems and robotics under realistic, complex settings.

Abstract

Recent advances in decision-making policies have led to significant progress in fields such as autonomous driving and robotics. However, testing these policies remains crucial with the existence of critical scenarios that may threaten their reliability. Despite ongoing research, challenges such as low testing efficiency and limited diversity persist due to the complexity of the decision-making policies and their environments. To address these challenges, this paper proposes an adaptable Large Language Model (LLM)-driven online testing framework to explore critical and diverse testing scenarios for decision-making policies. Specifically, we design a "generate-test-feedback" pipeline with templated prompt engineering to harness the world knowledge and reasoning abilities of LLMs. Additionally, a multi-scale scenario generation strategy is proposed to address the limitations of LLMs in making fine-grained adjustments, further enhancing testing efficiency. Finally, the proposed LLM-driven method is evaluated on five widely recognized benchmarks, and the experimental results demonstrate that our method significantly outperforms baseline methods in uncovering both critical and diverse scenarios. These findings suggest that LLM-driven methods hold significant promise for advancing the testing of decision-making policies.

Paper Structure

This paper contains 34 sections, 6 equations, 13 figures, 2 tables, 1 algorithm.

Figures (13)

  • Figure 1: The overview of LLM-driven online testing framework.
  • Figure 2: The workflow of our LLM-driven online testing framework. Scenario Database provides the seed scenario as a reference for generating new scenarios. Scenario Generator, through prompt engineering and a multi-scale generation strategy, gives full play to the intelligence of LLM to generate critical scenarios efficiently. Scenario Testing tests the policy in the given scenarios. Scenario Evaluation assesses the criticality and diversity based on the testing results, providing feedback to the LLM for self-improvement.
  • Figure 3: Key elements for designing a prompt.
  • Figure 4: The Prompt Template. Instruction (Role) provides the overview of LLM's task and the target environment. Scenario Information (##Scenario Information) outlines the common and unchanging information of the target environment. Input Message (##Input) includes the variable information during testing, such as the seed scenario, feedback, and expert experience. Scenario Generation (##Output) guides the LLM to generate and output a new scenario in a specified format.
  • Figure 5: The illustration of multi-scale generation strategy.
  • ...and 8 more figures