Quality-Diversity through AI Feedback

Herbie Bradley; Andrew Dai; Hannah Teufel; Jenny Zhang; Koen Oostermeijer; Marco Bellagente; Jeff Clune; Kenneth Stanley; Grégory Schott; Joel Lehman

Quality-Diversity through AI Feedback

Herbie Bradley, Andrew Dai, Hannah Teufel, Jenny Zhang, Koen Oostermeijer, Marco Bellagente, Jeff Clune, Kenneth Stanley, Grégory Schott, Joel Lehman

TL;DR

This work introduces Quality-Diversity through AI Feedback (QDAIF), a method that blends MAP-Elites with language-model–driven generation, evaluation, and refinement to explore diverse, high-quality text outputs in creative domains. By using LMs as both mutation operators (LMX) and evaluators of quality and diversity, QDAIF eliminates the need for handcrafted domain-specific metrics and scales with advances in foundation models. Across opinions, short stories, and poetry, QDAIF achieves higher QD scores and demonstrates alignment between AI and human judgments, while analyses reveal areas for improvement such as reward hacking and the calibration challenges of AI feedback. The approach suggests a pathway toward autonomous, open-ended search systems capable of generating, evaluating, and improving creative content across multiple modalities and domains.

Abstract

In many text-generation problems, users may prefer not only a single response, but a diverse range of high-quality outputs from which to choose. Quality-diversity (QD) search algorithms aim at such outcomes, by continually improving and diversifying a population of candidates. However, the applicability of QD to qualitative domains, like creative writing, has been limited by the difficulty of algorithmically specifying measures of quality and diversity. Interestingly, recent developments in language models (LMs) have enabled guiding search through AI feedback, wherein LMs are prompted in natural language to evaluate qualitative aspects of text. Leveraging this development, we introduce Quality-Diversity through AI Feedback (QDAIF), wherein an evolutionary algorithm applies LMs to both generate variation and evaluate the quality and diversity of candidate text. When assessed on creative writing domains, QDAIF covers more of a specified search space with high-quality samples than do non-QD controls. Further, human evaluation of QDAIF-generated creative texts validates reasonable agreement between AI and human evaluation. Our results thus highlight the potential of AI feedback to guide open-ended search for creative and original solutions, providing a recipe that seemingly generalizes to many domains and modalities. In this way, QDAIF is a step towards AI systems that can independently search, diversify, evaluate, and improve, which are among the core skills underlying human society's capacity for innovation.

Quality-Diversity through AI Feedback

TL;DR

Abstract

Paper Structure (66 sections, 41 figures, 60 tables)

This paper contains 66 sections, 41 figures, 60 tables.

Introduction
Background & Related Work
Evolution through Large Models
Quality Diversity Algorithms
AI Feedback
Approach
Experiments on Creative Writing Domain
Setup: Opinion Writing, Short Stories
Comparisons between QDAIF and Baselines
Extensions to AI Feedback and Mutation Model
Evolving Solutions through Instruction Guidance
Discussion and Conclusion
Appendix
Human Study on Quality-Diversity of Text Samples
Comparison of quality scores
...and 51 more sections

Figures (41)

Figure 1: QDAIF (left) covers more the search space with diverse, high-quality stories compared to the baseline (right). The baseline is LMX, Quality-Onlymeyerson2023language, which optimizes only for the quality of solutions. QDAIF discovered more interesting stories about a spy and a politician, covering examples such as romance stories with a happy-ending, to horror stories with a tragic-ending. The baseline produced a story (right-middle position, starting with "Jason") with a lower quality score due to the lack of a desired spy character (denoted by the red-colored bin, for a story with a neutral ending, and leaning to horror). QDAIF discovered a better, more-relevant story (bottom-middle position, starting with "a wealthy politician") for this same neutral bin.
Figure 2: Overview of Quality-Diversity through AI Feedback (QDAIF). Dark components are where Language Models (LM) are employed. QDAIF randomly selects a solution from the QD archive. This chosen solution (parent) forms part of the prompt that is fed into an LM, undergoing LMX mutation to produce a new solution. An LM then evaluates the quality and diversity attributes of the new solution. We compare the newly evaluated solution with its existing solutions in the QD archive, and update it.
Figure 3: QDAIF significantly outperforms baselines in QD score performance in all domains. Performance stats with mean bootstrapped 95% CI, across 5 random seed runs. The maximum possible QD score is 20 (100 for 2D archive (4th plot)). See \ref{['app:coverage_best_solution_discussion']} for additional stats.
Figure 4: QDAIF (LMX-guided) (left) covers the space of poetry with high-quality solutions (on a rating scale), with poems matching the closest bins. QDAIF solutions take qualitative inspiration from the seed poem's imagery of "fields of green waves" in \ref{['app:poetry_setup']} while giving meaningfully diverse kinds of poems across the search space. QDAIF (LMX-rewrite) (not shown) also covers more the space of diverse, high-quality poems compared to Random-Poems (right).
Figure 5: Correlation plot between quality rating from human annotators, and fitness range (quality computed from AI feedback). Mean human-annotated quality and statistical error for different ranges of AI feedback fitness scores indicate more frequent instances of reward hacking skalse2022defininglehman2019surprising from the outputs of some search methods evaluated in this study.
...and 36 more figures

Quality-Diversity through AI Feedback

TL;DR

Abstract

Quality-Diversity through AI Feedback

Authors

TL;DR

Abstract

Table of Contents

Figures (41)