Exploiting Contextual Structure to Generate Useful Auxiliary Tasks
Benedict Quartey, Ankit Shah, George Konidaris
TL;DR
The paper tackles the inefficiency of reinforcement learning in robotics by maximizing experience reuse through autonomously generated, temporally extended auxiliary tasks. It introduces TaskExplore, which constructs abstract LTL task templates and uses context-aware object embeddings from large language models to create auxiliary tasks via object swaps, all learned alongside a given target task with counterfactual, off-policy updates. The approach demonstrates that these auxiliary tasks share the target task's exploration requirements, improving directed exploration and learning efficiency in a home-like grid domain, without increasing environmental interactions. This contributes to lifelong learning by enabling automatic policy generation and reuse, with future work aimed at relaxing object propositional constraints using vision-language models.
Abstract
Reinforcement learning requires interaction with an environment, which is expensive for robots. This constraint necessitates approaches that work with limited environmental interaction by maximizing the reuse of previous experiences. We propose an approach that maximizes experience reuse while learning to solve a given task by generating and simultaneously learning useful auxiliary tasks. To generate these tasks, we construct an abstract temporal logic representation of the given task and leverage large language models to generate context-aware object embeddings that facilitate object replacements. Counterfactual reasoning and off-policy methods allow us to simultaneously learn these auxiliary tasks while solving the given target task. We combine these insights into a novel framework for multitask reinforcement learning and experimentally show that our generated auxiliary tasks share similar underlying exploration requirements as the given task, thereby maximizing the utility of directed exploration. Our approach allows agents to automatically learn additional useful policies without extra environment interaction.
