Table of Contents
Fetching ...

Policy-as-Prompt: Rethinking Content Moderation in the Age of Large Language Models

Konstantina Palla, José Luis Redondo García, Claudia Hauff, Francesco Fabbri, Henrik Lindström, Daniel R. Taber, Andreas Damianou, Mounia Lalmas

TL;DR

Policy-as-Prompt reframes content moderation by encoding guidelines directly into prompts for large language models, enabling dynamic, scalable enforcement without reliance on extensive annotated datasets. The paper formalizes this paradigm and analyzes challenges across technical, sociotechnical, organizational, and governance dimensions, offering a mitigation roadmap. Through empirical demonstrations of prompt-structure sensitivity and predictive multiplicity, it highlights brittleness and the need for robust evaluation, traceability, and cross-disciplinary collaboration. The work emphasizes that, while not fully autonomous, policy-as-prompt requires careful governance and ongoing research to realize adaptable, transparent moderation in diverse digital ecosystems.

Abstract

Content moderation plays a critical role in shaping safe and inclusive online environments, balancing platform standards, user expectations, and regulatory frameworks. Traditionally, this process involves operationalising policies into guidelines, which are then used by downstream human moderators for enforcement, or to further annotate datasets for training machine learning moderation models. However, recent advancements in large language models (LLMs) are transforming this landscape. These models can now interpret policies directly as textual inputs, eliminating the need for extensive data curation. This approach offers unprecedented flexibility, as moderation can be dynamically adjusted through natural language interactions. This paradigm shift raises important questions about how policies are operationalised and the implications for content moderation practices. In this paper, we formalise the emerging policy-as-prompt framework and identify five key challenges across four domains: Technical Implementation (1. translating policy to prompts, 2. sensitivity to prompt structure and formatting), Sociotechnical (3. the risk of technological determinism in policy formation), Organisational (4. evolving roles between policy and machine learning teams), and Governance (5. model governance and accountability). Through analysing these challenges across technical, sociotechnical, organisational, and governance dimensions, we discuss potential mitigation approaches. This research provides actionable insights for practitioners and lays the groundwork for future exploration of scalable and adaptive content moderation systems in digital ecosystems.

Policy-as-Prompt: Rethinking Content Moderation in the Age of Large Language Models

TL;DR

Policy-as-Prompt reframes content moderation by encoding guidelines directly into prompts for large language models, enabling dynamic, scalable enforcement without reliance on extensive annotated datasets. The paper formalizes this paradigm and analyzes challenges across technical, sociotechnical, organizational, and governance dimensions, offering a mitigation roadmap. Through empirical demonstrations of prompt-structure sensitivity and predictive multiplicity, it highlights brittleness and the need for robust evaluation, traceability, and cross-disciplinary collaboration. The work emphasizes that, while not fully autonomous, policy-as-prompt requires careful governance and ongoing research to realize adaptable, transparent moderation in diverse digital ecosystems.

Abstract

Content moderation plays a critical role in shaping safe and inclusive online environments, balancing platform standards, user expectations, and regulatory frameworks. Traditionally, this process involves operationalising policies into guidelines, which are then used by downstream human moderators for enforcement, or to further annotate datasets for training machine learning moderation models. However, recent advancements in large language models (LLMs) are transforming this landscape. These models can now interpret policies directly as textual inputs, eliminating the need for extensive data curation. This approach offers unprecedented flexibility, as moderation can be dynamically adjusted through natural language interactions. This paradigm shift raises important questions about how policies are operationalised and the implications for content moderation practices. In this paper, we formalise the emerging policy-as-prompt framework and identify five key challenges across four domains: Technical Implementation (1. translating policy to prompts, 2. sensitivity to prompt structure and formatting), Sociotechnical (3. the risk of technological determinism in policy formation), Organisational (4. evolving roles between policy and machine learning teams), and Governance (5. model governance and accountability). Through analysing these challenges across technical, sociotechnical, organisational, and governance dimensions, we discuss potential mitigation approaches. This research provides actionable insights for practitioners and lays the groundwork for future exploration of scalable and adaptive content moderation systems in digital ecosystems.

Paper Structure

This paper contains 23 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Approaches to content moderation. (a) Traditional pipeline: policy guidelines inform human annotation, which produces training data for models. (b) Policy-as-prompt: policy guidelines are encoded directly as prompts, enabling LLMs to perform moderation without explicit annotation datasets.
  • Figure 2: Example prompts demonstrating two approaches for algorithmic content moderation using policy guidelines. Left: A basic prompt where the policy text is provided to the model for direct review of content. Right: An enhanced prompt, following the 'in-context learning' technique, that includes both the policy text and specific examples of violative and non-violative content, aiding the model in contextualizing its decisions.
  • Figure 3: Effect of prompt design on accuracy: (a) Variation in accuracy across prompt types, averaging over five runs for each type and, (b) Performance differences across policy categories for the 'Punctuation' and 'Concise' prompt types.
  • Figure 4: Performance spread (accuracy) for modifications in the format in which the policy is plugged into the prompt. 'Baseline' refers to the plain text format.
  • Figure 5: Analysing both overall and per-category performance of the model when injecting different temporal snapshots of the policies
  • ...and 4 more figures