Table of Contents
Fetching ...

Accelerating Visual Reinforcement Learning with Separate Primitive Policy for Peg-in-Hole Tasks

Zichun Xu, Zhaomin Wang, Yuntao Li, Lei Zhuang, Zhiyuan Zhao, Guocai Yang, Jingdong Zhao

TL;DR

The paper addresses data efficiency in visual reinforcement learning for peg-in-hole assembly by introducing Separate Primitive Policy (S2P), which learns and executes location and insertion actions as distinct primitives. Built on the DrQ-v2 backbone, S2P uses two parallel policies and an ensemble of Q-functions to evaluate prospective actions, enabling sequential locate-then-insert behavior. Empirical results across ten simulated peg-in-hole tasks and real-world experiments show that S2P substantially improves sample efficiency and insertion success, even under force constraints. Ablation studies reveal that while S2P enhances performance for DrQ-v2, it does not consistently improve SAC-based learners, highlighting its dependency on the underlying RL algorithm. The work demonstrates practical viability of vision-based, primitive-policy strategies for fast, robust assembly in robotic manipulation.

Abstract

For peg-in-hole tasks, humans rely on binocular visual perception to locate the peg above the hole surface and then proceed with insertion. This paper draws insights from this behavior to enable agents to learn efficient assembly strategies through visual reinforcement learning. Hence, we propose a Separate Primitive Policy (S2P) to simultaneously learn how to derive location and insertion actions. S2P is compatible with model-free reinforcement learning algorithms. Ten insertion tasks featuring different polygons are developed as benchmarks for evaluations. Simulation experiments show that S2P can boost the sample efficiency and success rate even with force constraints. Real-world experiments are also performed to verify the feasibility of S2P. Ablations are finally given to discuss the generalizability of S2P and some factors that affect its performance.

Accelerating Visual Reinforcement Learning with Separate Primitive Policy for Peg-in-Hole Tasks

TL;DR

The paper addresses data efficiency in visual reinforcement learning for peg-in-hole assembly by introducing Separate Primitive Policy (S2P), which learns and executes location and insertion actions as distinct primitives. Built on the DrQ-v2 backbone, S2P uses two parallel policies and an ensemble of Q-functions to evaluate prospective actions, enabling sequential locate-then-insert behavior. Empirical results across ten simulated peg-in-hole tasks and real-world experiments show that S2P substantially improves sample efficiency and insertion success, even under force constraints. Ablation studies reveal that while S2P enhances performance for DrQ-v2, it does not consistently improve SAC-based learners, highlighting its dependency on the underlying RL algorithm. The work demonstrates practical viability of vision-based, primitive-policy strategies for fast, robust assembly in robotic manipulation.

Abstract

For peg-in-hole tasks, humans rely on binocular visual perception to locate the peg above the hole surface and then proceed with insertion. This paper draws insights from this behavior to enable agents to learn efficient assembly strategies through visual reinforcement learning. Hence, we propose a Separate Primitive Policy (S2P) to simultaneously learn how to derive location and insertion actions. S2P is compatible with model-free reinforcement learning algorithms. Ten insertion tasks featuring different polygons are developed as benchmarks for evaluations. Simulation experiments show that S2P can boost the sample efficiency and success rate even with force constraints. Real-world experiments are also performed to verify the feasibility of S2P. Ablations are finally given to discuss the generalizability of S2P and some factors that affect its performance.

Paper Structure

This paper contains 13 sections, 13 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of the proposed insertion strategy. The encoded visual representations are passed to the actor to infer location and insertion actions, which will be performed sequentially. The output actions are fed to the critic with an ensemble of $Q$-functions to evaluate the corresponding $Q$-values.
  • Figure 2: Network architectures for the actor and critic of S2P-DrQ-v2.
  • Figure 3: Simulation setup and peg-in-hole suites with different shapes, where pegs are initialized with being grasped by the gripper
  • Figure 4: Training performance of S2P against the plain policy, where the solid line and the shaded area represent the mean and standard deviation across 3 runs.
  • Figure 5: Benchmark results of S2P-DrQ-v2 and DrQ-v2 with force penalty.
  • ...and 4 more figures