Table of Contents
Fetching ...

Societal Impacts Research Requires Benchmarks for Creative Composition Tasks

Judy Hanwen Shen, Carlos Guestrin

TL;DR

The paper argues that the societal impacts of foundation models cannot be understood without benchmarks that capture creative, everyday tasks. It uses a large-scale thematic analysis of open-access prompts from WildChat-1M and LMSYS-Chat-1M to show that creative composition tasks are widespread and poorly covered by existing benchmarks. The authors propose usage-based, holistic evaluation paradigms that integrate transparency, multi-dimensional performance metrics, and foresight into potential harms, aiming to align benchmarks with real-world use and downstream consequences. This approach seeks to improve both model development and governance by focusing on how AI-generated creative content impacts individuals and society across multiple domains.

Abstract

Foundation models that are capable of automating cognitive tasks represent a pivotal technological shift, yet their societal implications remain unclear. These systems promise exciting advances, yet they also risk flooding our information ecosystem with formulaic, homogeneous, and potentially misleading synthetic content. Developing benchmarks grounded in real use cases where these risks are most significant is therefore critical. Through a thematic analysis using 2 million language model user prompts, we identify creative composition tasks as a prevalent usage category where users seek help with personal tasks that require everyday creativity. Our fine-grained analysis identifies mismatches between current benchmarks and usage patterns among these tasks. Crucially, we argue that the same use cases that currently lack thorough evaluations can lead to negative downstream impacts. This position paper argues that benchmarks focused on creative composition tasks is a necessary step towards understanding the societal harms of AI-generated content. We call for greater transparency in usage patterns to inform the development of new benchmarks that can effectively measure both the progress and the impacts of models with creative capabilities.

Societal Impacts Research Requires Benchmarks for Creative Composition Tasks

TL;DR

The paper argues that the societal impacts of foundation models cannot be understood without benchmarks that capture creative, everyday tasks. It uses a large-scale thematic analysis of open-access prompts from WildChat-1M and LMSYS-Chat-1M to show that creative composition tasks are widespread and poorly covered by existing benchmarks. The authors propose usage-based, holistic evaluation paradigms that integrate transparency, multi-dimensional performance metrics, and foresight into potential harms, aiming to align benchmarks with real-world use and downstream consequences. This approach seeks to improve both model development and governance by focusing on how AI-generated creative content impacts individuals and society across multiple domains.

Abstract

Foundation models that are capable of automating cognitive tasks represent a pivotal technological shift, yet their societal implications remain unclear. These systems promise exciting advances, yet they also risk flooding our information ecosystem with formulaic, homogeneous, and potentially misleading synthetic content. Developing benchmarks grounded in real use cases where these risks are most significant is therefore critical. Through a thematic analysis using 2 million language model user prompts, we identify creative composition tasks as a prevalent usage category where users seek help with personal tasks that require everyday creativity. Our fine-grained analysis identifies mismatches between current benchmarks and usage patterns among these tasks. Crucially, we argue that the same use cases that currently lack thorough evaluations can lead to negative downstream impacts. This position paper argues that benchmarks focused on creative composition tasks is a necessary step towards understanding the societal harms of AI-generated content. We call for greater transparency in usage patterns to inform the development of new benchmarks that can effectively measure both the progress and the impacts of models with creative capabilities.

Paper Structure

This paper contains 51 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Themes of Creative Composition tasks from a qualitative analysis of user prompts. While some common use cases have been studied and evaluated by past benchmarks, many tasks (yellow dotted boxes) have not been measured by prior work. * indicates that the datasets for evaluation are not currently publicly available.
  • Figure 2: Creative composition tasks encompass use cases that need to be carefully evaluated to avoid harm in downstream applications. We highlight five areas where creative composition tasks that currently lack thorough evaluation may lead to undesirable consequences.
  • Figure 3: Prompt filtering pipeline: after applying moderation, language, length filtering an duplication, around 40% of total conversation were used for clustering and subsequent thematic analysis.

Theorems & Definitions (1)

  • Definition 2.1