Table of Contents
Fetching ...

Scaling Policy Gradient Quality-Diversity with Massive Parallelization via Behavioral Variations

Konstantinos Mitsides, Maxence Faldor, Antoine Cully

TL;DR

This paper introduces ASCII-ME, a scalable, policy-gradient-based QD algorithm that augments MAP-Elites without centralized actor-critic training. By using an ASCII operator that interpolates action sequences based on time-step performance and mapping these changes through a Jacobian to genotype space, it achieves strong sample and runtime efficiency, attaining high-quality, diverse DNN policies on a single GPU in under 250 seconds. The method demonstrates superior QD scores and faster runtimes across five Brax locomotion tasks, maintaining performance as parallelization increases, and shows synergy between ASCII and Iso+LineDD while avoiding AC training bottlenecks. Overall, ASCII-ME provides a practical, scalable framework for evolving large neural networks with strong diversity and competitive performance, suitable for deployment on consumer-grade hardware and for future non-AC PG-based QD research.

Abstract

Quality-Diversity optimization comprises a family of evolutionary algorithms aimed at generating a collection of diverse and high-performing solutions. MAP-Elites (ME), a notable example, is used effectively in fields like evolutionary robotics. However, the reliance of ME on random mutations from Genetic Algorithms limits its ability to evolve high-dimensional solutions. Methods proposed to overcome this include using gradient-based operators like policy gradients or natural evolution strategies. While successful at scaling ME for neuroevolution, these methods often suffer from slow training speeds, or difficulties in scaling with massive parallelization due to high computational demands or reliance on centralized actor-critic training. In this work, we introduce a fast, sample-efficient ME based algorithm capable of scaling up with massive parallelization, significantly reducing runtimes without compromising performance. Our method, ASCII-ME, unlike existing policy gradient quality-diversity methods, does not rely on centralized actor-critic training. It performs behavioral variations based on time step performance metrics and maps these variations to solutions using policy gradients. Our experiments show that ASCII-ME can generate a diverse collection of high-performing deep neural network policies in less than 250 seconds on a single GPU. Additionally, it operates on average, five times faster than state-of-the-art algorithms while still maintaining competitive sample efficiency.

Scaling Policy Gradient Quality-Diversity with Massive Parallelization via Behavioral Variations

TL;DR

This paper introduces ASCII-ME, a scalable, policy-gradient-based QD algorithm that augments MAP-Elites without centralized actor-critic training. By using an ASCII operator that interpolates action sequences based on time-step performance and mapping these changes through a Jacobian to genotype space, it achieves strong sample and runtime efficiency, attaining high-quality, diverse DNN policies on a single GPU in under 250 seconds. The method demonstrates superior QD scores and faster runtimes across five Brax locomotion tasks, maintaining performance as parallelization increases, and shows synergy between ASCII and Iso+LineDD while avoiding AC training bottlenecks. Overall, ASCII-ME provides a practical, scalable framework for evolving large neural networks with strong diversity and competitive performance, suitable for deployment on consumer-grade hardware and for future non-AC PG-based QD research.

Abstract

Quality-Diversity optimization comprises a family of evolutionary algorithms aimed at generating a collection of diverse and high-performing solutions. MAP-Elites (ME), a notable example, is used effectively in fields like evolutionary robotics. However, the reliance of ME on random mutations from Genetic Algorithms limits its ability to evolve high-dimensional solutions. Methods proposed to overcome this include using gradient-based operators like policy gradients or natural evolution strategies. While successful at scaling ME for neuroevolution, these methods often suffer from slow training speeds, or difficulties in scaling with massive parallelization due to high computational demands or reliance on centralized actor-critic training. In this work, we introduce a fast, sample-efficient ME based algorithm capable of scaling up with massive parallelization, significantly reducing runtimes without compromising performance. Our method, ASCII-ME, unlike existing policy gradient quality-diversity methods, does not rely on centralized actor-critic training. It performs behavioral variations based on time step performance metrics and maps these variations to solutions using policy gradients. Our experiments show that ASCII-ME can generate a diverse collection of high-performing deep neural network policies in less than 250 seconds on a single GPU. Additionally, it operates on average, five times faster than state-of-the-art algorithms while still maintaining competitive sample efficiency.

Paper Structure

This paper contains 38 sections, 15 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: The ASCII-ME method employs two distinct variation operators within the standard MAP-Elites loop: (1) Iso+LineDD, which mutates a parent genotype based on that of a randomly selected elite; (2) ASCII, which interpolates between the parent's behavior and another behavior sampled from the buffer, using performance metrics encapsulated in $\bm{Z}$. The behavioral changes are then mapped to the genotypic space by $\bm{J}$ to mutate the parent genotype.
  • Figure 2: Main metrics for ASCII-ME and baselines across tasks, with each algorithm running for the duration required for the slowest one to complete one million evaluations. The solid line represents the median, and the shaded area indicates the lower and upper quartiles across 20 seeds. Checkpoints show the number of evaluations completed by each algorithm at that point.
  • Figure 3: QD score (left) and runtime (right) for ASCII-ME and all baselines with varying batch sizes across tasks, after one million evaluations. Vertical lines on bars show lower and upper quartiles; bar height indicates the median over 20 seeds.
  • Figure 4: Accumulated number of solutions added to the archive for Iso+LineDD variation operator and PG variation operator plus Actor Injection (AI). The solid line is the median and the shaded area represents lower and upper quartiles over 20 seeds.
  • Figure 5: QD score of ASCII-ME with different mutation proportions, after one million evaluations. Vertical lines on bars show lower and upper quartiles; bar height indicates the median over 20 seeds.
  • ...and 2 more figures