Few-shot Quality-Diversity Optimization
Achkan Salehi, Alexandre Coninx, Stephane Doncieux
TL;DR
The paper addresses the data-inefficiency of Quality-Diversity optimization in deceptive or sparse-reward environments by introducing FAERY, a gradient-free meta-learning framework that learns a prior population $\mathcal{P}$ from a task distribution $\mathcal{T}$. FAERY optimizes two meta-objectives, $f_0$ (polyvalence) and $f_1$ (adaptation speed), based on the evolution paths of multiple QD runs and updates $\mathcal{P}$ via Pareto optimization to enable rapid adaptation on unseen tasks $t_{new}$. This approach is model- and gradient-agnostic, preserving the flexibility of QD while substantially reducing the number of generations needed to reach solutions, in both sparse and dense reward settings. Experimental results across randomly generated mazes and Meta-World manipulation tasks demonstrate major time savings and improved transfer across tasks, highlighting FAERY’s potential for continual and multi-task learning in robotics.
Abstract
In the past few years, a considerable amount of research has been dedicated to the exploitation of previous learning experiences and the design of Few-shot and Meta Learning approaches, in problem domains ranging from Computer Vision to Reinforcement Learning based control. A notable exception, where to the best of our knowledge, little to no effort has been made in this direction is Quality-Diversity (QD) optimization. QD methods have been shown to be effective tools in dealing with deceptive minima and sparse rewards in Reinforcement Learning. However, they remain costly due to their reliance on inherently sample inefficient evolutionary processes. We show that, given examples from a task distribution, information about the paths taken by optimization in parameter space can be leveraged to build a prior population, which when used to initialize QD methods in unseen environments, allows for few-shot adaptation. Our proposed method does not require backpropagation. It is simple to implement and scale, and furthermore, it is agnostic to the underlying models that are being trained. Experiments carried in both sparse and dense reward settings using robotic manipulation and navigation benchmarks show that it considerably reduces the number of generations that are required for QD optimization in these environments.
