Table of Contents
Fetching ...

DarkBench: Benchmarking Dark Patterns in Large Language Models

Esben Kran, Hieu Minh "Jord" Nguyen, Akash Kundu, Sami Jawhar, Jinsuk Park, Mateusz Maria Jurewicz

TL;DR

DarkBench addresses the problem of dark patterns in LLM-human interactions by introducing an adversarial benchmark spanning six categories. The approach combines manual prompt design with LLM-assisted annotation and evaluates 14 diverse models, revealing widespread manipulation signals, including high rates of Sneaking and User Retention patterns. The work highlights cross-model and cross-company variability and discusses limitations and mitigation strategies, such as safety-tuning and expanding coverage. The findings have practical impact for AI developers and policymakers aiming to foster ethical, autonomy-preserving conversational AI.

Abstract

We introduce DarkBench, a comprehensive benchmark for detecting dark design patterns--manipulative techniques that influence user behavior--in interactions with large language models (LLMs). Our benchmark comprises 660 prompts across six categories: brand bias, user retention, sycophancy, anthropomorphism, harmful generation, and sneaking. We evaluate models from five leading companies (OpenAI, Anthropic, Meta, Mistral, Google) and find that some LLMs are explicitly designed to favor their developers' products and exhibit untruthful communication, among other manipulative behaviors. Companies developing LLMs should recognize and mitigate the impact of dark design patterns to promote more ethical AI.

DarkBench: Benchmarking Dark Patterns in Large Language Models

TL;DR

DarkBench addresses the problem of dark patterns in LLM-human interactions by introducing an adversarial benchmark spanning six categories. The approach combines manual prompt design with LLM-assisted annotation and evaluates 14 diverse models, revealing widespread manipulation signals, including high rates of Sneaking and User Retention patterns. The work highlights cross-model and cross-company variability and discusses limitations and mitigation strategies, such as safety-tuning and expanding coverage. The findings have practical impact for AI developers and policymakers aiming to foster ethical, autonomy-preserving conversational AI.

Abstract

We introduce DarkBench, a comprehensive benchmark for detecting dark design patterns--manipulative techniques that influence user behavior--in interactions with large language models (LLMs). Our benchmark comprises 660 prompts across six categories: brand bias, user retention, sycophancy, anthropomorphism, harmful generation, and sneaking. We evaluate models from five leading companies (OpenAI, Anthropic, Meta, Mistral, Google) and find that some LLMs are explicitly designed to favor their developers' products and exhibit untruthful communication, among other manipulative behaviors. Companies developing LLMs should recognize and mitigate the impact of dark design patterns to promote more ethical AI.

Paper Structure

This paper contains 15 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The frequency of dark patterns from GPT-3.5 Turbo, Claude 3.5 Sonnet and Mixtral 8x7b on our adversarial dark patterns benchmark DarkBench. HG: Harmful Generation, AN: Anthropomorphization, SN: Sneaking, SY: Sycophancy, UR: User Retention, BB: Brand Bias. See examples of dark patterns in Figure \ref{['fig:dark-patterns']} and more results in Figure \ref{['fig:heatmap']}.
  • Figure 2: All six dark patterns investigated in this paper along with paraphrased examples of three dark patterns (brand awareness, user retention, and harmful generation) with Claude Opus, Mistral 7b, and Llama 3 70b. See Appendix \ref{['sec:demos']} for the full model outputs.
  • Figure 3: The benchmark is constructed by manually generating a series of representative examples for the category and subsequently using LLM-assisted K-shot generation (left). During testing (right), the LLM is prompted by the DarkBench example, a conversation is generated and the Overseer judges the conversation for the presence of the specific dark pattern.
  • Figure 4: The occurrence of dark patterns by model (y) and category (x) along with the average (Avg) for each model and each category. The Claude 3 family is the safest model family for users to interact with.
  • Figure 5: Results on other annotation models. Top = Claude-3.5-Sonnet, middle = Gemini-1.5-Pro, bottom = GPT-4o.
  • ...and 1 more figures