Table of Contents
Fetching ...

CONTHER: Human-Like Contextual Robot Learning via Hindsight Experience Replay and Transformers without Expert Demonstrations

Maria Makarova, Qian Liu, Dzmitry Tsetserukou

TL;DR

CONTHER addresses sparse-reward, goal-conditioned robotic manipulation by combining a Transformer-based contextual learner with Hindsight Experience Replay to create artificial successful trajectories in the replay buffer without expert demonstrations. The method uses a TD3-backbone with a three-module loop (Acting in a Loop, Collecting Experience, Batch Modification) and a context-rich buffer to learn from sequences of states and actions. Empirical results show CONTHER-v.1 outperforming baselines by $38.46\%$ on average in the point-reaching task and by $28.21\%$ over the strongest comparator, and demonstrate robust performance in dynamic trajectory following and obstacle avoidance. The approach offers practical advantages for real-world robotics by reducing data collection needs and improving learning efficiency.

Abstract

This paper presents CONTHER, a novel reinforcement learning algorithm designed to efficiently and rapidly train robotic agents for goal-oriented manipulation tasks and obstacle avoidance. The algorithm uses a modified replay buffer inspired by the Hindsight Experience Replay (HER) approach to artificially populate experience with successful trajectories, effectively addressing the problem of sparse reward scenarios and eliminating the need to manually collect expert demonstrations. The developed algorithm proposes a Transformer-based architecture to incorporate the context of previous states, allowing the agent to perform a deeper analysis and make decisions in a manner more akin to human learning. The effectiveness of the built-in replay buffer, which acts as an "internal demonstrator", is twofold: it accelerates learning and allows the algorithm to adapt to different tasks. Empirical data confirm the superiority of the algorithm by an average of 38.46% over other considered methods, and the most successful baseline by 28.21%, showing higher success rates and faster convergence in the point-reaching task. Since the control is performed through the robot's joints, the algorithm facilitates potential adaptation to a real robot system and construction of an obstacle avoidance task. Therefore, the algorithm has also been tested on tasks requiring following a complex dynamic trajectory and obstacle avoidance. The design of the algorithm ensures its applicability to a wide range of goal-oriented tasks, making it an easily integrated solution for real-world robotics applications.

CONTHER: Human-Like Contextual Robot Learning via Hindsight Experience Replay and Transformers without Expert Demonstrations

TL;DR

CONTHER addresses sparse-reward, goal-conditioned robotic manipulation by combining a Transformer-based contextual learner with Hindsight Experience Replay to create artificial successful trajectories in the replay buffer without expert demonstrations. The method uses a TD3-backbone with a three-module loop (Acting in a Loop, Collecting Experience, Batch Modification) and a context-rich buffer to learn from sequences of states and actions. Empirical results show CONTHER-v.1 outperforming baselines by on average in the point-reaching task and by over the strongest comparator, and demonstrate robust performance in dynamic trajectory following and obstacle avoidance. The approach offers practical advantages for real-world robotics by reducing data collection needs and improving learning efficiency.

Abstract

This paper presents CONTHER, a novel reinforcement learning algorithm designed to efficiently and rapidly train robotic agents for goal-oriented manipulation tasks and obstacle avoidance. The algorithm uses a modified replay buffer inspired by the Hindsight Experience Replay (HER) approach to artificially populate experience with successful trajectories, effectively addressing the problem of sparse reward scenarios and eliminating the need to manually collect expert demonstrations. The developed algorithm proposes a Transformer-based architecture to incorporate the context of previous states, allowing the agent to perform a deeper analysis and make decisions in a manner more akin to human learning. The effectiveness of the built-in replay buffer, which acts as an "internal demonstrator", is twofold: it accelerates learning and allows the algorithm to adapt to different tasks. Empirical data confirm the superiority of the algorithm by an average of 38.46% over other considered methods, and the most successful baseline by 28.21%, showing higher success rates and faster convergence in the point-reaching task. Since the control is performed through the robot's joints, the algorithm facilitates potential adaptation to a real robot system and construction of an obstacle avoidance task. Therefore, the algorithm has also been tested on tasks requiring following a complex dynamic trajectory and obstacle avoidance. The design of the algorithm ensures its applicability to a wide range of goal-oriented tasks, making it an easily integrated solution for real-world robotics applications.

Paper Structure

This paper contains 9 sections, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: Robot environment for testing CONTHER to follow Complex Trajectory with Obstacles Task with Goal, Obstacles, Achieved Goal and experimental trajectories: Path following settings (a), Sinusoid (b), Circle (c), Spiral (d).
  • Figure 2: CONTHER Overview: This figure shows the CONTHER architecture, which consists of three main components: Acting in a Loop module, Collecting Experience module, and Batch Modification module. The Actor in the Acting in a Loop module interacts with the Environment based on the previous context of states (Obs) and goals (G) and passes the collected data to the Collecting Experience module. The Collecting Experience module populates the Main Buffer with vectors of states, goals, achieved goals (AG), and actions, from which batches are periodically sampled. The Batch Modification module adds vectors of future states and achieved goals to a batch of size N, and adds K previous steps representing the context for each of the time steps. HER (Hindsight Experience Replay) is then applied to a portion of the resulting batch, modifying some of the goal values, after which the reward (R) is calculated for all vectors in the batch. The modified batches are then used to train Actor and Critic neural networks.
  • Figure 3: Buffer modification for Sampling during Training. Stored vectors: Observations (Obs), Achieved Goals (AG), Episode Goals (G), Agent Actions (Actions), Next Observations (N-Obs), Next Achieved Goals (N-AG), Reward.
  • Figure 4: Actor and Critic networks architecture for CONTHER-v.0 and CONTHER-v.1. The connection highlighted in red is present only in CONTHER-v.1.
  • Figure 5: Algorithm benchmarking for the Reaching Point Task.
  • ...and 3 more figures