Autonomous Curriculum Design via Relative Entropy Based Task Modifications
Muhammed Yusuf Satici, Jianxun Wang, David L. Roberts
TL;DR
Curriculum learning can reduce training time but often requires manual design. The authors propose READ-C, an autonomous curriculum design framework that identifies high-uncertainty states using a relative-entropy measure $D_{KL}(P_{true}||P_{learnt})$ and uses start-state modifications to steer learning. They present two implementations: READ-C-TD with a teacher-based uncertainty calculation and READ-C-SA with a self-assessed regressor, and prove convergence under a two-time-scale RL framework. Empirical evaluation across Key-Lock, Capture-the-Flag, and Parking domains shows READ-C variants outperform random curricula and direct target-task learning, with READ-C-SA offering robust, teacher-free gains and heuristic variants further boosting performance. The work demonstrates a scalable, uncertainty-driven mechanism for automated curriculum design with practical gains in sample efficiency.
Abstract
Curriculum learning is a training method in which an agent is first trained on a curriculum of relatively simple tasks related to a target task in an effort to shorten the time required to train on the target task. Autonomous curriculum design involves the design of such curriculum with no reliance on human knowledge and/or expertise. Finding an efficient and effective way of autonomously designing curricula remains an open problem. We propose a novel approach for automatically designing curricula by leveraging the learner's uncertainty to select curricula tasks. Our approach measures the uncertainty in the learner's policy using relative entropy, and guides the agent to states of high uncertainty to facilitate learning. Our algorithm supports the generation of autonomous curricula in a self-assessed manner by leveraging the learner's past and current policies but it also allows the use of teacher guided design in an instructive setting. We provide theoretical guarantees for the convergence of our algorithm using two time-scale optimization processes. Results show that our algorithm outperforms randomly generated curriculum, and learning directly on the target task as well as the curriculum-learning criteria existing in literature. We also present two additional heuristic distance measures that could be combined with our relative-entropy approach for further performance improvements.
