Table of Contents
Fetching ...

Intelligent Resilience Testing for Decision-Making Agents with Dual-Mode Surrogate Adaptation

Jingxuan Yang, Weichao Xu, Yuchen Shi, Yi Zhang, Shuo Feng, Huaxin Pei

TL;DR

IRTest tackles the surrogate-to-real gap in testing decision-making agents by integrating offline surrogate models with online adaptive corrections. It offers two operational modes: data-rich online fine-tuning of a neural SPM and data-limited weighting via a mixture of SPMs with importance sampling, guided by Bayesian optimization. Extensive experiments across three heterogeneous environments show improved failure-discovery efficiency, robustness, and cross-system generalizability, with clear gains from both adaptation strategies and IS techniques. The framework demonstrates practical potential for scalable, adaptive testing in complex, high-dimensional agent scenarios.

Abstract

Testing and evaluating decision-making agents remains challenging due to unknown system architectures, limited access to internal states, and the vastness of high-dimensional scenario spaces. Existing testing approaches often rely on surrogate models of decision-making agents to generate large-scale scenario libraries; however, discrepancies between surrogate models and real decision-making agents significantly limit their generalizability and practical applicability. To address this challenge, this paper proposes intelligent resilience testing (IRTest), a unified online adaptive testing framework designed to rapidly adjust to diverse decision-making agents. IRTest initializes with an offline-trained surrogate prediction model and progressively reduces surrogate-to-real gap during testing through two complementary adaptation mechanisms: (i) online neural fine-tuning in data-rich regimes, and (ii) lightweight importance-sampling-based weighting correction in data-limited regimes. A Bayesian optimization strategy, equipped with bias-corrected acquisition functions, guides scenario generation to balance exploration and exploitation in complex testing spaces. Extensive experiments across varying levels of task complexity and system heterogeneity demonstrate that IRTest consistently improves failure-discovery efficiency, testing robustness, and cross-system generalizability. These results highlight the potential of IRTest as a practical solution for scalable, adaptive, and resilient testing of decision-making agents.

Intelligent Resilience Testing for Decision-Making Agents with Dual-Mode Surrogate Adaptation

TL;DR

IRTest tackles the surrogate-to-real gap in testing decision-making agents by integrating offline surrogate models with online adaptive corrections. It offers two operational modes: data-rich online fine-tuning of a neural SPM and data-limited weighting via a mixture of SPMs with importance sampling, guided by Bayesian optimization. Extensive experiments across three heterogeneous environments show improved failure-discovery efficiency, robustness, and cross-system generalizability, with clear gains from both adaptation strategies and IS techniques. The framework demonstrates practical potential for scalable, adaptive testing in complex, high-dimensional agent scenarios.

Abstract

Testing and evaluating decision-making agents remains challenging due to unknown system architectures, limited access to internal states, and the vastness of high-dimensional scenario spaces. Existing testing approaches often rely on surrogate models of decision-making agents to generate large-scale scenario libraries; however, discrepancies between surrogate models and real decision-making agents significantly limit their generalizability and practical applicability. To address this challenge, this paper proposes intelligent resilience testing (IRTest), a unified online adaptive testing framework designed to rapidly adjust to diverse decision-making agents. IRTest initializes with an offline-trained surrogate prediction model and progressively reduces surrogate-to-real gap during testing through two complementary adaptation mechanisms: (i) online neural fine-tuning in data-rich regimes, and (ii) lightweight importance-sampling-based weighting correction in data-limited regimes. A Bayesian optimization strategy, equipped with bias-corrected acquisition functions, guides scenario generation to balance exploration and exploitation in complex testing spaces. Extensive experiments across varying levels of task complexity and system heterogeneity demonstrate that IRTest consistently improves failure-discovery efficiency, testing robustness, and cross-system generalizability. These results highlight the potential of IRTest as a practical solution for scalable, adaptive, and resilient testing of decision-making agents.

Paper Structure

This paper contains 22 sections, 12 equations, 7 figures, 3 tables, 2 algorithms.

Figures (7)

  • Figure 1: The workflow of IRTest.
  • Figure 2: Illustration of the IRTest framework.
  • Figure 3: Illustration of the PredatorPrey, CoopNavi, and BipedalWalker environments.
  • Figure 4: IRTest-R results: precision when recall is 0.5, average precision, precision and recall across the three environments, where $\epsilon=0.2$ and classification threshold $f_{\mathrm{th}} = 0.5$.
  • Figure 5: IRTest-L results: AP and combination coefficients across the three environments, where $\epsilon=0.05$ and classification threshold $f_{\mathrm{th}} = 0.5$.
  • ...and 2 more figures