Exploiting Contextual Structure to Generate Useful Auxiliary Tasks

Benedict Quartey; Ankit Shah; George Konidaris

Exploiting Contextual Structure to Generate Useful Auxiliary Tasks

Benedict Quartey, Ankit Shah, George Konidaris

TL;DR

The paper tackles the inefficiency of reinforcement learning in robotics by maximizing experience reuse through autonomously generated, temporally extended auxiliary tasks. It introduces TaskExplore, which constructs abstract LTL task templates and uses context-aware object embeddings from large language models to create auxiliary tasks via object swaps, all learned alongside a given target task with counterfactual, off-policy updates. The approach demonstrates that these auxiliary tasks share the target task's exploration requirements, improving directed exploration and learning efficiency in a home-like grid domain, without increasing environmental interactions. This contributes to lifelong learning by enabling automatic policy generation and reuse, with future work aimed at relaxing object propositional constraints using vision-language models.

Abstract

Reinforcement learning requires interaction with an environment, which is expensive for robots. This constraint necessitates approaches that work with limited environmental interaction by maximizing the reuse of previous experiences. We propose an approach that maximizes experience reuse while learning to solve a given task by generating and simultaneously learning useful auxiliary tasks. To generate these tasks, we construct an abstract temporal logic representation of the given task and leverage large language models to generate context-aware object embeddings that facilitate object replacements. Counterfactual reasoning and off-policy methods allow us to simultaneously learn these auxiliary tasks while solving the given target task. We combine these insights into a novel framework for multitask reinforcement learning and experimentally show that our generated auxiliary tasks share similar underlying exploration requirements as the given task, thereby maximizing the utility of directed exploration. Our approach allows agents to automatically learn additional useful policies without extra environment interaction.

Exploiting Contextual Structure to Generate Useful Auxiliary Tasks

TL;DR

Abstract

Paper Structure (12 sections, 6 equations, 6 figures)

This paper contains 12 sections, 6 equations, 6 figures.

Introduction
Background
Linear Temporal Logic
Off-policy Learning with LTL
Related Work
Problem Definition
Exploiting contextual structure to generate auxiliary tasks
Exploiting Structure in Object Relationships
Exploiting Structure in Task Composition
Off-policy Updates via Counterfactual Experience
Results and Discussion
Conclusion

Figures (6)

Figure 1: HomeGrid, a deterministic discrete grid-world domain. Agents in this world complete tasks by visiting grid locations corresponding to objects in the environment. Tasks are specified with LTL formulae that represent the sequence/ordering of subgoals necessary for completing a given task. Satisfying these tasks involves visiting relevant grid cells in an acceptable order determined by the task specification. As an example, the numbered arrows in the diagram indicate a policy for satisfying the Goto Fridge task.
Figure 2: This figure depicts the TaskExplore framework. Given a task specified in linear temporal logic, we construct an abstract task template that replaces instance object propositions in the given formula with large language model embeddings of their descriptions, capturing various relevant attributes of each object. We then generate auxiliary tasks by selecting objects from the environment for each proposition node in our abstract task template using the cosine similarity metric. We initialize policies (Q-value functions) for the given task and all auxiliary tasks and perform RL where actions are selected $\epsilon$-greedy on only the given task, gathering directed experiences necessary for solving the given task. At each learning step, all Q-value functions are updated via off-policy Q-learning updates.
Figure 3: This figure depicts how TaskExplore constructs and leverages context-aware object embeddings and abstract task representations/templates. In Figure a, we use an autoregressive LLM to generate detailed descriptions for the list of objects in our environment and use an encoder language model to encode these generated descriptions into a 768-dimensional vector for each object. We then cluster these description embeddings, discovering object classes that capture the semantic and contextual similarity between objects. In Figure b, our approach constructs a task template by representing proposition nodes in the abstract syntax graph of a given LTL formula with embeddings of corresponding objects. With this task template, we can create new contextually similar tasks by selecting objects from the environment based on their cosine similarity, balancing selections between highly correlated objects and relevant yet unseen objects.
Figure 4: This figure depicts the results of performing k-means clustering on 768-dimensional embedding vectors for each environment object, results are visualized in a 2D latent space. Embeddings for each object in Figure (a) are generated by encoding the shown object name using the Sentence-T5 model. Conversely, embeddings in Figure (b) are generated by Sentence-T5 encoding text descriptions of each object generated by text-davinci-003. The number of clusters used in the k-means algorithm was four(4) based on the number distinct exploration zones in HomeGrid. Note that embeddings generated from LLM object descriptions improved the separation of emergent cluster boundaries, and desirably increased the distance in latent space between similar yet contextually different objects such as Kitchen Cabinet and Bathroom Cabinet.
Figure 5: Figure (a) shows the normalized discounted reward obtained by the agent on the given task as it learns to solve it simultaneously with TaskExplore generated auxiliary tasks using a random behavior policy ($\pi^{random}$) and an epsilon greedy behavior policy ($\pi^*$). Figure (b) shows the task success rate on auxiliary tasks as learning progresses. Learning TaskExplore generated tasks while using epsilon greedy ($\pi^*$) behavior policy on the given task significantly outperforms all other baselines. All results are normalized over 7 different seeded runs
...and 1 more figures

Exploiting Contextual Structure to Generate Useful Auxiliary Tasks

TL;DR

Abstract

Exploiting Contextual Structure to Generate Useful Auxiliary Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (6)