Table of Contents
Fetching ...

Red Teaming LLMs as Socio-Technical Practice: From Exploration and Data Creation to Evaluation

Adriana Alvarado Garcia, Ruyuan Wan, Ozioma C. Oguine, Karla Badillo-Urquiola

TL;DR

This paper investigates how red-teaming datasets for large language models are created, defined, and evaluated as a socio-technical practice. Through 22 semi-structured interviews with practitioners, it reveals three critical moments in red-teaming: defining the task, developing adversarial datasets, and evaluating outcomes, showing that harm is socially constructed via data choices and contextual factors. The study identifies three opportunities for HCI to expand red-teaming practices: contextualized scenario design, domain-expert harm taxonomies, and evaluating risks at the level of interaction rather than isolated prompts, addressing context, interaction type, and user specificity. These insights underscore the need for participatory, transparent, and compositional approaches to AI safety that account for real-world deployment and diverse user needs, challenging narrow benchmark-centric views of model safety.

Abstract

Recently, red teaming, with roots in security, has become a key evaluative approach to ensure the safety and reliability of Generative Artificial Intelligence. However, most existing work emphasizes technical benchmarks and attack success rates, leaving the socio-technical practices of how red teaming datasets are defined, created, and evaluated under-examined. Drawing on 22 interviews with practitioners who design and evaluate red teaming datasets, we examine the data practices and standards that underpin this work. Because adversarial datasets determine the scope and accuracy of model evaluations, they are critical artifacts for assessing potential harms from large language models. Our contributions are first, empirical evidence of practitioners conceptualizing red teaming and developing and evaluating red teaming datasets. Second, we reflect on how practitioners' conceptualization of risk leads to overlooking the context, interaction type, and user specificity. We conclude with three opportunities for HCI researchers to expand the conceptualization and data practices for red-teaming.

Red Teaming LLMs as Socio-Technical Practice: From Exploration and Data Creation to Evaluation

TL;DR

This paper investigates how red-teaming datasets for large language models are created, defined, and evaluated as a socio-technical practice. Through 22 semi-structured interviews with practitioners, it reveals three critical moments in red-teaming: defining the task, developing adversarial datasets, and evaluating outcomes, showing that harm is socially constructed via data choices and contextual factors. The study identifies three opportunities for HCI to expand red-teaming practices: contextualized scenario design, domain-expert harm taxonomies, and evaluating risks at the level of interaction rather than isolated prompts, addressing context, interaction type, and user specificity. These insights underscore the need for participatory, transparent, and compositional approaches to AI safety that account for real-world deployment and diverse user needs, challenging narrow benchmark-centric views of model safety.

Abstract

Recently, red teaming, with roots in security, has become a key evaluative approach to ensure the safety and reliability of Generative Artificial Intelligence. However, most existing work emphasizes technical benchmarks and attack success rates, leaving the socio-technical practices of how red teaming datasets are defined, created, and evaluated under-examined. Drawing on 22 interviews with practitioners who design and evaluate red teaming datasets, we examine the data practices and standards that underpin this work. Because adversarial datasets determine the scope and accuracy of model evaluations, they are critical artifacts for assessing potential harms from large language models. Our contributions are first, empirical evidence of practitioners conceptualizing red teaming and developing and evaluating red teaming datasets. Second, we reflect on how practitioners' conceptualization of risk leads to overlooking the context, interaction type, and user specificity. We conclude with three opportunities for HCI researchers to expand the conceptualization and data practices for red-teaming.
Paper Structure (36 sections, 3 tables)