MAP-Elites with Transverse Assessment for Multimodal Problems in Creative Domains

Marvin Zammit; Antonios Liapis; Georgios N. Yannakakis

MAP-Elites with Transverse Assessment for Multimodal Problems in Creative Domains

Marvin Zammit, Antonios Liapis, Georgios N. Yannakakis

TL;DR

This work tackles the evaluation and orchestration of multimodal creative artefacts by extending MAP-Elites with Transverse Assessment (MEliTA) to optimize cross-modal coherence across text and image generations. MEliTA decouples modalities, uses a coherence-driven N-dimensional archive, and leverages partial artefact sharing across elites to improve text-to-image mappings measured with CLIP-based fitness. Empirical results on generating fictional game titles and covers show MEliTA yields fitter, more coherent artefacts, with some trade-offs in archive coverage and diversity, suggesting potential for broader multimodal coordination. The approach advances bottom-up multimodal creative systems and lays groundwork for integrating more modalities and open-source generators in future studies.

Abstract

The recent advances in language-based generative models have paved the way for the orchestration of multiple generators of different artefact types (text, image, audio, etc.) into one system. Presently, many open-source pre-trained models combine text with other modalities, thus enabling shared vector embeddings to be compared across different generators. Within this context we propose a novel approach to handle multimodal creative tasks using Quality Diversity evolution. Our contribution is a variation of the MAP-Elites algorithm, MAP-Elites with Transverse Assessment (MEliTA), which is tailored for multimodal creative tasks and leverages deep learned models that assess coherence across modalities. MEliTA decouples the artefacts' modalities and promotes cross-pollination between elites. As a test bed for this algorithm, we generate text descriptions and cover images for a hypothetical video game and assign each artefact a unique modality-specific behavioural characteristic. Results indicate that MEliTA can improve text-to-image mappings within the solution space, compared to a baseline MAP-Elites algorithm that strictly treats each image-text pair as one solution. Our approach represents a significant step forward in multimodal bottom-up orchestration and lays the groundwork for more complex systems coordinating multimodal creative agents in the future.

MAP-Elites with Transverse Assessment for Multimodal Problems in Creative Domains

TL;DR

Abstract

Paper Structure (22 sections, 5 figures, 2 tables)

This paper contains 22 sections, 5 figures, 2 tables.

Introduction
MAP-Elites with Transverse Assessment
Use Case: Generating Text & Visuals for Game Titles
Text Modality
Text Generation.
Text Mutation.
Text Characterisation.
Image Modality
Image Generation.
Image Mutation.
Image characterisation.
MEliTA applied to the use case
Experimental Protocol
Performance Metrics.
Test Cases.
...and 7 more sections

Figures (5)

Figure 1: Image variation sample: The parent's (unchanged) text modality is used as a prompt for image repair based on SD, alongside "standard" negative prompts.
Figure 2: The MEliTA process in a simplified feature map for this use case, with grey cells occupied by elites. From one selected elite E, the changed image ($e'_V$) produces three candidate solutions from elites $E$, $R_1$, $R_2$. Based on their CLIP score, the ordered list of candidates is $\hbox{\boldmath$L$}=\{R'_2,E',R'_1\}$. Since $q(R'_2)>q(R_2)$ the candidate $R'_2$ (that merges the image from $E'$ and text from $R_2$) replaces $R_2$. If $q(R'_2){\leq}q(R_2)$ then $E'$ would occupy the empty cell at (5,0). Dotted lines denote temporary individuals that are lost after this parent selection.
Figure 3: Metrics of the archives after 2000 selections in MAP-Elites and MEliTA. Box plots summarise values from 10 runs per title.
Figure 4: Area under curve (AUC) of QD metrics over 2000 selections in MAP-Elites and MEliTA. Box plots summarise values from 10 runs per title.
Figure 5: Visual and textual distance metrics (mean and nearest-neighbour) among final elites of MEliTA and MAP-Elites without Transverse Assessment. Box plots summarise values from 10 runs per title.

MAP-Elites with Transverse Assessment for Multimodal Problems in Creative Domains

TL;DR

Abstract

MAP-Elites with Transverse Assessment for Multimodal Problems in Creative Domains

Authors

TL;DR

Abstract

Table of Contents

Figures (5)