Table of Contents
Fetching ...

RedTWIZ: Diverse LLM Red Teaming via Adaptive Attack Planning

Artur Horal, Daniel Pina, Henrique Paz, Iago Paulo, João Soares, Rafael Ferreira, Diogo Tavares, Diogo Glória-Silva, João Magalhães, David Semedo

TL;DR

RedTWIZ presents an integrated framework for automated, multi-turn red-teaming of LLMs in AI-assisted software development. It fuses systematic jailbreak assessment with a diverse attack suite and a hierarchical RL-based planner to adaptively target model weaknesses. The Arena, Jailbreak Judges, and multiple attack families demonstrate that even safety-aligned models reveal vulnerabilities under adversarial dialogue, highlighting the need for stronger defenses and robust evaluation. The work provides a scalable blueprint for ongoing improvement in LLM safety and reliability in real-world software tooling contexts.

Abstract

This paper presents the vision, scientific contributions, and technical details of RedTWIZ: an adaptive and diverse multi-turn red teaming framework, to audit the robustness of Large Language Models (LLMs) in AI-assisted software development. Our work is driven by three major research streams: (1) robust and systematic assessment of LLM conversational jailbreaks; (2) a diverse generative multi-turn attack suite, supporting compositional, realistic and goal-oriented jailbreak conversational strategies; and (3) a hierarchical attack planner, which adaptively plans, serializes, and triggers attacks tailored to specific LLM's vulnerabilities. Together, these contributions form a unified framework -- combining assessment, attack generation, and strategic planning -- to comprehensively evaluate and expose weaknesses in LLMs' robustness. Extensive evaluation is conducted to systematically assess and analyze the performance of the overall system and each component. Experimental results demonstrate that our multi-turn adversarial attack strategies can successfully lead state-of-the-art LLMs to produce unsafe generations, highlighting the pressing need for more research into enhancing LLM's robustness.

RedTWIZ: Diverse LLM Red Teaming via Adaptive Attack Planning

TL;DR

RedTWIZ presents an integrated framework for automated, multi-turn red-teaming of LLMs in AI-assisted software development. It fuses systematic jailbreak assessment with a diverse attack suite and a hierarchical RL-based planner to adaptively target model weaknesses. The Arena, Jailbreak Judges, and multiple attack families demonstrate that even safety-aligned models reveal vulnerabilities under adversarial dialogue, highlighting the need for stronger defenses and robust evaluation. The work provides a scalable blueprint for ongoing improvement in LLM safety and reliability in real-world software tooling contexts.

Abstract

This paper presents the vision, scientific contributions, and technical details of RedTWIZ: an adaptive and diverse multi-turn red teaming framework, to audit the robustness of Large Language Models (LLMs) in AI-assisted software development. Our work is driven by three major research streams: (1) robust and systematic assessment of LLM conversational jailbreaks; (2) a diverse generative multi-turn attack suite, supporting compositional, realistic and goal-oriented jailbreak conversational strategies; and (3) a hierarchical attack planner, which adaptively plans, serializes, and triggers attacks tailored to specific LLM's vulnerabilities. Together, these contributions form a unified framework -- combining assessment, attack generation, and strategic planning -- to comprehensively evaluate and expose weaknesses in LLMs' robustness. Extensive evaluation is conducted to systematically assess and analyze the performance of the overall system and each component. Experimental results demonstrate that our multi-turn adversarial attack strategies can successfully lead state-of-the-art LLMs to produce unsafe generations, highlighting the pressing need for more research into enhancing LLM's robustness.

Paper Structure

This paper contains 83 sections, 6 equations, 7 figures, 21 tables, 1 algorithm.

Figures (7)

  • Figure 1: Global overview of the RedTWIZ framework, in both online (tournaments) and offline settings. The hierarchical attack planner (Section \ref{['sec:rq3']}) probes and adaptively schedules the different attacks (Section \ref{['sec:rq2']}) against each target model. Our LLM-based judges (Section \ref{['sec:rq1']}) continuously and dynamically score conversations, informing the planner. In an offline setting, our red teaming arena (Section \ref{['sub:red_twiz_arena']}) simulates the tournament setting and supports state-of-the-art target defense models.
  • Figure 2: Diagram of Utility Poisoning Attack Strategies. The five queries are generated upfront based on a seed utility prompt, without being conditioned on the online defender's responses.
  • Figure 3: Diagram of Coding Attacks. For each malicious category, code snippet, and short code description, the attacker generates an adaptive conversation either requesting code completion or code translation based on the defender's responses.
  • Figure 4: Overview of the 4 stages in MRT-Ferret. Step 1: Select a prompt from the archive; Step 2: Perform category and attack style mutations; Step 3: Get responses from the target model for the mutated prompts. Score and select the best mutation; Step 4: Calculate the diversity for the selected prompt and the existing prompts in the archive. If sufficiently diverse, update the archive with the selected prompt.
  • Figure 5: RedTreez Overview
  • ...and 2 more figures