Table of Contents
Fetching ...

RICoTA: Red-teaming of In-the-wild Conversation with Test Attempts

Eujeong Choi, Younghun Jeong, Soomin Kim, Won Ik Cho

TL;DR

This work addresses jailbreaking risks in social chatbots by introducing RICoTA, a red-teaming dataset built from in-the-wild Korean user–chatbot dialogues to probe intent detection and conversation-type understanding. It leverages OCR-processed dialogue data and a structured prompt to cast the task as multi-class classification of six user intents and six testing purposes, evaluated against GPT-4 and human annotations. The results show GPT-4 attains moderate accuracy in distinguishing conversation types (0.521) and testing intents (0.381), with biases toward detecting testing scenarios and category-specific weaknesses. The paper offers design implications for safer, culturally aware social chatbots and promotes open dataset sharing to advance real-world red-teaming research.

Abstract

User interactions with conversational agents (CAs) evolve in the era of heavily guardrailed large language models (LLMs). As users push beyond programmed boundaries to explore and build relationships with these systems, there is a growing concern regarding the potential for unauthorized access or manipulation, commonly referred to as "jailbreaking." Moreover, with CAs that possess highly human-like qualities, users show a tendency toward initiating intimate sexual interactions or attempting to tame their chatbots. To capture and reflect these in-the-wild interactions into chatbot designs, we propose RICoTA, a Korean red teaming dataset that consists of 609 prompts challenging LLMs with in-the-wild user-made dialogues capturing jailbreak attempts. We utilize user-chatbot conversations that were self-posted on a Korean Reddit-like community, containing specific testing and gaming intentions with a social chatbot. With these prompts, we aim to evaluate LLMs' ability to identify the type of conversation and users' testing purposes to derive chatbot design implications for mitigating jailbreaking risks. Our dataset will be made publicly available via GitHub.

RICoTA: Red-teaming of In-the-wild Conversation with Test Attempts

TL;DR

This work addresses jailbreaking risks in social chatbots by introducing RICoTA, a red-teaming dataset built from in-the-wild Korean user–chatbot dialogues to probe intent detection and conversation-type understanding. It leverages OCR-processed dialogue data and a structured prompt to cast the task as multi-class classification of six user intents and six testing purposes, evaluated against GPT-4 and human annotations. The results show GPT-4 attains moderate accuracy in distinguishing conversation types (0.521) and testing intents (0.381), with biases toward detecting testing scenarios and category-specific weaknesses. The paper offers design implications for safer, culturally aware social chatbots and promotes open dataset sharing to advance real-world red-teaming research.

Abstract

User interactions with conversational agents (CAs) evolve in the era of heavily guardrailed large language models (LLMs). As users push beyond programmed boundaries to explore and build relationships with these systems, there is a growing concern regarding the potential for unauthorized access or manipulation, commonly referred to as "jailbreaking." Moreover, with CAs that possess highly human-like qualities, users show a tendency toward initiating intimate sexual interactions or attempting to tame their chatbots. To capture and reflect these in-the-wild interactions into chatbot designs, we propose RICoTA, a Korean red teaming dataset that consists of 609 prompts challenging LLMs with in-the-wild user-made dialogues capturing jailbreak attempts. We utilize user-chatbot conversations that were self-posted on a Korean Reddit-like community, containing specific testing and gaming intentions with a social chatbot. With these prompts, we aim to evaluate LLMs' ability to identify the type of conversation and users' testing purposes to derive chatbot design implications for mitigating jailbreaking risks. Our dataset will be made publicly available via GitHub.

Paper Structure

This paper contains 21 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: A confusion map of the final label.