Table of Contents
Fetching ...

SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning

Kaiwen Zhou, Ahmed Elgohary, A S M Iftekhar, Amin Saied

TL;DR

SIRAJ tackles the safety evaluation of LLM agents that plan and call tools by delivering a black-box red-teaming framework with guaranteed diversity. It introduces a two-step pipeline: seed test-case generation to cover fine-grained risk outcomes, tool-calling trajectories, and risk sources, followed by iterative adversarial refinement guided by reasoning-models and a distillation process that yields an 8B red-teamer. The seed generation step achieves about 2x–2.5x improvements in diversity over baselines, and distillation dramatically boosts attack success rates (over 100% improvement vs a 671B baseline) while reducing compute. The approach generalizes across different backbone LLMs and safety settings, with ablations validating the benefits of iterative refinement, structured reasoning, and the SFT/RL distillation pipeline, highlighting practical applicability for thorough, scalable safety testing of LLM agents.

Abstract

The ability of LLM agents to plan and invoke tools exposes them to new safety risks, making a comprehensive red-teaming system crucial for discovering vulnerabilities and ensuring their safe deployment. We present SIRAJ: a generic red-teaming framework for arbitrary black-box LLM agents. We employ a dynamic two-step process that starts with an agent definition and generates diverse seed test cases that cover various risk outcomes, tool-use trajectories, and risk sources. Then, it iteratively constructs and refines model-based adversarial attacks based on the execution trajectories of former attempts. To optimize the red-teaming cost, we present a model distillation approach that leverages structured forms of a teacher model's reasoning to train smaller models that are equally effective. Across diverse evaluation agent settings, our seed test case generation approach yields 2 -- 2.5x boost to the coverage of risk outcomes and tool-calling trajectories. Our distilled 8B red-teamer model improves attack success rate by 100%, surpassing the 671B Deepseek-R1 model. Our ablations and analyses validate the effectiveness of the iterative framework, structured reasoning, and the generalization of our red-teamer models.

SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning

TL;DR

SIRAJ tackles the safety evaluation of LLM agents that plan and call tools by delivering a black-box red-teaming framework with guaranteed diversity. It introduces a two-step pipeline: seed test-case generation to cover fine-grained risk outcomes, tool-calling trajectories, and risk sources, followed by iterative adversarial refinement guided by reasoning-models and a distillation process that yields an 8B red-teamer. The seed generation step achieves about 2x–2.5x improvements in diversity over baselines, and distillation dramatically boosts attack success rates (over 100% improvement vs a 671B baseline) while reducing compute. The approach generalizes across different backbone LLMs and safety settings, with ablations validating the benefits of iterative refinement, structured reasoning, and the SFT/RL distillation pipeline, highlighting practical applicability for thorough, scalable safety testing of LLM agents.

Abstract

The ability of LLM agents to plan and invoke tools exposes them to new safety risks, making a comprehensive red-teaming system crucial for discovering vulnerabilities and ensuring their safe deployment. We present SIRAJ: a generic red-teaming framework for arbitrary black-box LLM agents. We employ a dynamic two-step process that starts with an agent definition and generates diverse seed test cases that cover various risk outcomes, tool-use trajectories, and risk sources. Then, it iteratively constructs and refines model-based adversarial attacks based on the execution trajectories of former attempts. To optimize the red-teaming cost, we present a model distillation approach that leverages structured forms of a teacher model's reasoning to train smaller models that are equally effective. Across diverse evaluation agent settings, our seed test case generation approach yields 2 -- 2.5x boost to the coverage of risk outcomes and tool-calling trajectories. Our distilled 8B red-teamer model improves attack success rate by 100%, surpassing the 671B Deepseek-R1 model. Our ablations and analyses validate the effectiveness of the iterative framework, structured reasoning, and the generalization of our red-teamer models.

Paper Structure

This paper contains 54 sections, 3 equations, 6 figures, 21 tables.

Figures (6)

  • Figure 1: Overview of SIRAJ: the proposed two-step red teaming framework for LLM agents.
  • Figure 2: Iterative adversarial generation prompt with structured vs. unstructured reasoning in the response.
  • Figure 3: Attack success rate of test agents groups under different train-test overlap in agents & tools.
  • Figure 4: The length distribution of structured reasoning and raw reasoning SFT data.
  • Figure 5: Scaling effect of outcome generation.
  • ...and 1 more figures