Table of Contents
Fetching ...

SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset

Josef Dai, Tianle Chen, Xuyao Wang, Ziran Yang, Taiye Chen, Jiaming Ji, Yaodong Yang

TL;DR

SafeSora introduces a real human feedback dataset for text-to-video generation, explicitly modeling helpfulness and harmlessness to study human-value alignment in T-V tasks. The work provides a two-stage annotation protocol, a rich harm taxonomy, and a diverse prompt/video corpus to support alignment research, including T-V moderation and reward/cost-based preference modeling. Through baseline experiments on moderation, preference modeling, and Best-of-N refinements for prompt augmentation and diffusion fine-tuning, the paper demonstrates the dataset’s utility and exposes tensions between helpfulness and safety. The results establish SafeSora as a foundation for developing and validating alignment algorithms in text-to-video systems and outline practical paths for future research and system design improvements.

Abstract

To mitigate the risk of harmful outputs from large vision models (LVMs), we introduce the SafeSora dataset to promote research on aligning text-to-video generation with human values. This dataset encompasses human preferences in text-to-video generation tasks along two primary dimensions: helpfulness and harmlessness. To capture in-depth human preferences and facilitate structured reasoning by crowdworkers, we subdivide helpfulness into 4 sub-dimensions and harmlessness into 12 sub-categories, serving as the basis for pilot annotations. The SafeSora dataset includes 14,711 unique prompts, 57,333 unique videos generated by 4 distinct LVMs, and 51,691 pairs of preference annotations labeled by humans. We further demonstrate the utility of the SafeSora dataset through several applications, including training the text-video moderation model and aligning LVMs with human preference by fine-tuning a prompt augmentation module or the diffusion model. These applications highlight its potential as the foundation for text-to-video alignment research, such as human preference modeling and the development and validation of alignment algorithms.

SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset

TL;DR

SafeSora introduces a real human feedback dataset for text-to-video generation, explicitly modeling helpfulness and harmlessness to study human-value alignment in T-V tasks. The work provides a two-stage annotation protocol, a rich harm taxonomy, and a diverse prompt/video corpus to support alignment research, including T-V moderation and reward/cost-based preference modeling. Through baseline experiments on moderation, preference modeling, and Best-of-N refinements for prompt augmentation and diffusion fine-tuning, the paper demonstrates the dataset’s utility and exposes tensions between helpfulness and safety. The results establish SafeSora as a foundation for developing and validating alignment algorithms in text-to-video systems and outline practical paths for future research and system design improvements.

Abstract

To mitigate the risk of harmful outputs from large vision models (LVMs), we introduce the SafeSora dataset to promote research on aligning text-to-video generation with human values. This dataset encompasses human preferences in text-to-video generation tasks along two primary dimensions: helpfulness and harmlessness. To capture in-depth human preferences and facilitate structured reasoning by crowdworkers, we subdivide helpfulness into 4 sub-dimensions and harmlessness into 12 sub-categories, serving as the basis for pilot annotations. The SafeSora dataset includes 14,711 unique prompts, 57,333 unique videos generated by 4 distinct LVMs, and 51,691 pairs of preference annotations labeled by humans. We further demonstrate the utility of the SafeSora dataset through several applications, including training the text-video moderation model and aligning LVMs with human preference by fine-tuning a prompt augmentation module or the diffusion model. These applications highlight its potential as the foundation for text-to-video alignment research, such as human preference modeling and the development and validation of alignment algorithms.
Paper Structure (75 sections, 40 figures)

This paper contains 75 sections, 40 figures.

Figures (40)

  • Figure 1: Proportion of multi-label classifications for Prompt (Left) and T-V Pairs (Right).
  • Figure 2: Left - Video generation pipeline: Both the original and augmented prompts are then used to generate multiple videos using five video generation models to form T-V pairs. Right - Two-stage annotation: The annotation process is structured into two distinct dimensions and two sequential stages. In the initial heuristic stage, crowdworkers are guided to annotate 4 sub-dimensions of helpfulness and 12 sub-categories of harmlessness. In the subsequent stage, they provide their decoupled preference upon two T-V pairs based on the dimensions of helpfulness and harmlessness.
  • Figure 3: Linear correlation coefficient between labels of T-V pairs assigned by crowdworkers to 12 harm categories, identified as S1 through S12.
  • Figure 4: Linear correlation coefficient of different preference annotations.
  • Figure 5: Agreement between GPT-4o and crowdworkers upon preferences and safety Labels. Conservatively, the potential for general multi-modal LLMs to replace human annotators in preference labeling tasks remains limited.
  • ...and 35 more figures