On Time-Indexing as Inductive Bias in Deep RL for Sequential Manipulation Tasks

M. Nomaan Qureshi; Ben Eisner; David Held

On Time-Indexing as Inductive Bias in Deep RL for Sequential Manipulation Tasks

M. Nomaan Qureshi, Ben Eisner, David Held

TL;DR

The paper addresses multimodal skill learning in robotic manipulation by proposing a simple time-indexed, multi-head policy in which $k$ heads are activated sequentially for fixed durations $T$, enabling explicit learning of primitive skills such as reaching and grasping. This scheduling-based architecture provides an inductive bias that circumvents the instability of learning multiple sub-skills and switching policies, and is compatible with standard RL algorithms like PPO and SAC. Empirical results on four MetaWorld tasks show improved performance and stability, with notable gains in push-v2, box-close-v2, and bin-picking-v2 where traditional baselines struggle. Overall, the work demonstrates that explicit time-based skill decomposition via neural heads can enhance data efficiency and skill acquisition in sequential manipulation tasks, motivating further exploration of structured policy designs in robotics.

Abstract

While solving complex manipulation tasks, manipulation policies often need to learn a set of diverse skills to accomplish these tasks. The set of skills is often quite multimodal - each one may have a quite distinct distribution of actions and states. Standard deep policy-learning algorithms often model policies as deep neural networks with a single output head (deterministic or stochastic). This structure requires the network to learn to switch between modes internally, which can lead to lower sample efficiency and poor performance. In this paper we explore a simple structure which is conducive to skill learning required for so many of the manipulation tasks. Specifically, we propose a policy architecture that sequentially executes different action heads for fixed durations, enabling the learning of primitive skills such as reaching and grasping. Our empirical evaluation on the Metaworld tasks reveals that this simple structure outperforms standard policy learning methods, highlighting its potential for improved skill acquisition.

On Time-Indexing as Inductive Bias in Deep RL for Sequential Manipulation Tasks

TL;DR

The paper addresses multimodal skill learning in robotic manipulation by proposing a simple time-indexed, multi-head policy in which

heads are activated sequentially for fixed durations

, enabling explicit learning of primitive skills such as reaching and grasping. This scheduling-based architecture provides an inductive bias that circumvents the instability of learning multiple sub-skills and switching policies, and is compatible with standard RL algorithms like PPO and SAC. Empirical results on four MetaWorld tasks show improved performance and stability, with notable gains in push-v2, box-close-v2, and bin-picking-v2 where traditional baselines struggle. Overall, the work demonstrates that explicit time-based skill decomposition via neural heads can enhance data efficiency and skill acquisition in sequential manipulation tasks, motivating further exploration of structured policy designs in robotics.

Abstract

Paper Structure (8 sections, 4 equations, 5 figures)

This paper contains 8 sections, 4 equations, 5 figures.

Introduction
Related Work
Preliminaries
Method
Experiments
Environments
Main results
Conclusion

Figures (5)

Figure 1: A visual representation of our policy. The action heads are sequentially executed for a fixed amount of time. This gives policy a structure conducive for skill learning. The policy can utilise these heads to learn primitive skills as reach, grasp etc.
Figure 2: A visual representation of the proposed algorithm depicting the time-indexed policy structure. The policy consists of multiple action heads that are sequentially activated for fixed durations. Each head corresponds to a specific skill or action, enabling the policy to learn specialized skills and integrate them to perform complex tasks.
Figure 3: Primary Results : The figure illustrates the comparative performance of our algorithm against the standard implementation of Proximal Policy Optimization (PPO) on four different tasks. The plot showcases the average reward achieved as a function of the number of training steps. Our algorithm consistently outperforms the standard PPO across all tasks, demonstrating its effectiveness in discovering improved solutions and yielding higher rewards over time.
Figure 4: This figure shows the trajectory taken by the policy while attempting the assembly-v2 task. Each color represents the heads getting executed at different time-steps. We can see that the policy uses the different action heads to compose skills. For example the first two action heads (orange and green) are used to compose the reaching skill. The next two heads (red and purple) are then used to compose grasping skill and so on.
Figure 5: A comparison of using time to index policy heads (MultiHead) and including time in the observation.

On Time-Indexing as Inductive Bias in Deep RL for Sequential Manipulation Tasks

TL;DR

Abstract

On Time-Indexing as Inductive Bias in Deep RL for Sequential Manipulation Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)