Visual Spatial Attention and Proprioceptive Data-Driven Reinforcement Learning for Robust Peg-in-Hole Task Under Variable Conditions

André Yuji Yasutomi; Hideyuki Ichiwara; Hiroshi Ito; Hiroki Mori; Tetsuya Ogata

Visual Spatial Attention and Proprioceptive Data-Driven Reinforcement Learning for Robust Peg-in-Hole Task Under Variable Conditions

André Yuji Yasutomi, Hideyuki Ichiwara, Hiroshi Ito, Hiroki Mori, Tetsuya Ogata

TL;DR

This work tackles robust peg-in-hole insertion in concrete under variable lighting by fusing vision with proprioceptive data through a spatial attention network (SAP) and a deep reinforcement learning policy (SAP-RL-E). The model is trained offline in a data-efficient framework that builds a hole map and minimizes reality gap, enabling end-to-end prediction of attention points and robot actions. Empirical results show SAP-based models outperform proprioception-only baselines across lighting variations, with SAP-RL-E achieving the highest offline SR (≈97%) and a competitive CT (~7.65 s); online, SAP-RL-E achieves SR around 93.9% with CT ~8.21 s, demonstrating strong real-world applicability. The approach offers practical benefits for construction automation, reducing training time and maintaining robustness to shadows and surface irregularities, and it can extend to other insertion and assembly tasks with challenging surfaces.

Abstract

Anchor-bolt insertion is a peg-in-hole task performed in the construction field for holes in concrete. Efforts have been made to automate this task, but the variable lighting and hole surface conditions, as well as the requirements for short setup and task execution time make the automation challenging. In this study, we introduce a vision and proprioceptive data-driven robot control model for this task that is robust to challenging lighting and hole surface conditions. This model consists of a spatial attention point network (SAP) and a deep reinforcement learning (DRL) policy that are trained jointly end-to-end to control the robot. The model is trained in an offline manner, with a sample-efficient framework designed to reduce training time and minimize the reality gap when transferring the model to the physical world. Through evaluations with an industrial robot performing the task in 12 unknown holes, starting from 16 different initial positions, and under three different lighting conditions (two with misleading shadows), we demonstrate that SAP can generate relevant attention points of the image even in challenging lighting conditions. We also show that the proposed model enables task execution with higher success rate and shorter task completion time than various baselines. Due to the proposed model's high effectiveness even in severe lighting, initial positions, and hole conditions, and the offline training framework's high sample-efficiency and short training time, this approach can be easily applied to construction.

Visual Spatial Attention and Proprioceptive Data-Driven Reinforcement Learning for Robust Peg-in-Hole Task Under Variable Conditions

TL;DR

Abstract

Paper Structure (26 sections, 2 equations, 9 figures, 4 tables)

This paper contains 26 sections, 2 equations, 9 figures, 4 tables.

Introduction
Related work
Peg-in-hole with DRL
Attention for motion generation
Methods
Hole search and peg insertion strategy
Motion generation model
Model inputs and outputs
Model structure
DRL algorithm
Loss function
Offline training framework
Models for comparison
Experimental Setup and Conditions
Experimental setup
...and 11 more sections

Figures (9)

Figure 1: Proposed model structure and training architecture. For the CNNs, values in parentheses are the channel sizes, k is kernel size, and s is stride. For the fully connected layers, values in parentheses are the number of nodes. Values in brackets are the data sizes. $FT$ are the forces and torques, $D_z$ is robot displacement toward the wall, and $a(dir,ss)$ is the discrete robot action which is dependent on the direction $dir$ and step size $ss$.
Figure 2: Hole search method. $P_{z,init}$ is the initial offset position. $D$ is the peg displacement. Hole borders are chamfered due to concrete's brittleness. (a) Approach. (b) Attempt. (c) Separation. (d) Attempt. (e) Insertion.
Figure 3: Details of SAP-RL (top) and AE-RL (bottom).
Figure 4: Illustration of hole map for offline training. In this study, area is ± 5 mm from hole center and points are spaced 0.25 mm apart.
Figure 5: Experimental setup for inserting anchor bolt. (a) Entire setup. (b) End effector, holes, and anchor bolt.
...and 4 more figures

Visual Spatial Attention and Proprioceptive Data-Driven Reinforcement Learning for Robust Peg-in-Hole Task Under Variable Conditions

TL;DR

Abstract

Visual Spatial Attention and Proprioceptive Data-Driven Reinforcement Learning for Robust Peg-in-Hole Task Under Variable Conditions

Authors

TL;DR

Abstract

Table of Contents

Figures (9)