Table of Contents
Fetching ...

MAP-Elites with Transverse Assessment for Multimodal Problems in Creative Domains

Marvin Zammit, Antonios Liapis, Georgios N. Yannakakis

TL;DR

This work tackles the evaluation and orchestration of multimodal creative artefacts by extending MAP-Elites with Transverse Assessment (MEliTA) to optimize cross-modal coherence across text and image generations. MEliTA decouples modalities, uses a coherence-driven N-dimensional archive, and leverages partial artefact sharing across elites to improve text-to-image mappings measured with CLIP-based fitness. Empirical results on generating fictional game titles and covers show MEliTA yields fitter, more coherent artefacts, with some trade-offs in archive coverage and diversity, suggesting potential for broader multimodal coordination. The approach advances bottom-up multimodal creative systems and lays groundwork for integrating more modalities and open-source generators in future studies.

Abstract

The recent advances in language-based generative models have paved the way for the orchestration of multiple generators of different artefact types (text, image, audio, etc.) into one system. Presently, many open-source pre-trained models combine text with other modalities, thus enabling shared vector embeddings to be compared across different generators. Within this context we propose a novel approach to handle multimodal creative tasks using Quality Diversity evolution. Our contribution is a variation of the MAP-Elites algorithm, MAP-Elites with Transverse Assessment (MEliTA), which is tailored for multimodal creative tasks and leverages deep learned models that assess coherence across modalities. MEliTA decouples the artefacts' modalities and promotes cross-pollination between elites. As a test bed for this algorithm, we generate text descriptions and cover images for a hypothetical video game and assign each artefact a unique modality-specific behavioural characteristic. Results indicate that MEliTA can improve text-to-image mappings within the solution space, compared to a baseline MAP-Elites algorithm that strictly treats each image-text pair as one solution. Our approach represents a significant step forward in multimodal bottom-up orchestration and lays the groundwork for more complex systems coordinating multimodal creative agents in the future.

MAP-Elites with Transverse Assessment for Multimodal Problems in Creative Domains

TL;DR

This work tackles the evaluation and orchestration of multimodal creative artefacts by extending MAP-Elites with Transverse Assessment (MEliTA) to optimize cross-modal coherence across text and image generations. MEliTA decouples modalities, uses a coherence-driven N-dimensional archive, and leverages partial artefact sharing across elites to improve text-to-image mappings measured with CLIP-based fitness. Empirical results on generating fictional game titles and covers show MEliTA yields fitter, more coherent artefacts, with some trade-offs in archive coverage and diversity, suggesting potential for broader multimodal coordination. The approach advances bottom-up multimodal creative systems and lays groundwork for integrating more modalities and open-source generators in future studies.

Abstract

The recent advances in language-based generative models have paved the way for the orchestration of multiple generators of different artefact types (text, image, audio, etc.) into one system. Presently, many open-source pre-trained models combine text with other modalities, thus enabling shared vector embeddings to be compared across different generators. Within this context we propose a novel approach to handle multimodal creative tasks using Quality Diversity evolution. Our contribution is a variation of the MAP-Elites algorithm, MAP-Elites with Transverse Assessment (MEliTA), which is tailored for multimodal creative tasks and leverages deep learned models that assess coherence across modalities. MEliTA decouples the artefacts' modalities and promotes cross-pollination between elites. As a test bed for this algorithm, we generate text descriptions and cover images for a hypothetical video game and assign each artefact a unique modality-specific behavioural characteristic. Results indicate that MEliTA can improve text-to-image mappings within the solution space, compared to a baseline MAP-Elites algorithm that strictly treats each image-text pair as one solution. Our approach represents a significant step forward in multimodal bottom-up orchestration and lays the groundwork for more complex systems coordinating multimodal creative agents in the future.
Paper Structure (22 sections, 5 figures, 2 tables)

This paper contains 22 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Image variation sample: The parent's (unchanged) text modality is used as a prompt for image repair based on SD, alongside "standard" negative prompts.
  • Figure 2: The MEliTA process in a simplified feature map for this use case, with grey cells occupied by elites. From one selected elite E, the changed image ($e'_V$) produces three candidate solutions from elites $E$, $R_1$, $R_2$. Based on their CLIP score, the ordered list of candidates is $\hbox{\boldmath$L$}=\{R'_2,E',R'_1\}$. Since $q(R'_2)>q(R_2)$ the candidate $R'_2$ (that merges the image from $E'$ and text from $R_2$) replaces $R_2$. If $q(R'_2){\leq}q(R_2)$ then $E'$ would occupy the empty cell at (5,0). Dotted lines denote temporary individuals that are lost after this parent selection.
  • Figure 3: Metrics of the archives after 2000 selections in MAP-Elites and MEliTA. Box plots summarise values from 10 runs per title.
  • Figure 4: Area under curve (AUC) of QD metrics over 2000 selections in MAP-Elites and MEliTA. Box plots summarise values from 10 runs per title.
  • Figure 5: Visual and textual distance metrics (mean and nearest-neighbour) among final elites of MEliTA and MAP-Elites without Transverse Assessment. Box plots summarise values from 10 runs per title.