Table of Contents
Fetching ...

Achieving Faster and More Accurate Operation of Deep Predictive Learning

Masaki Yoshikawa, Hiroshi Ito, Tetsuya Ogata

TL;DR

Addresses the challenge of achieving both high speed and high precision in robot operations within real-world environments. Proposes HSARNNST, a motion-generation model with visual attention, a Hierarchical RNN for fused visual and motion cues, and a Softmax-based decoder to produce a discrete action distribution, enabling fast, stable inference. Training uses slow task teaching to ensure data quality while inference runs at up to three times the teaching speed, demonstrated on a real-robot sport-stacking task achieving a 0.94 success rate. The results suggest that combining high-quality data with structured, discrete-action inference can enable robust, end-to-end robotic learning suitable for socially interactive applications.

Abstract

Achieving both high speed and precision in robot operations is a significant challenge for social implementation. While factory robots excel at predefined tasks, they struggle with environment-specific actions like cleaning and cooking. Deep learning research aims to address this by enabling robots to autonomously execute behaviors through end-to-end learning with sensor data. RT-1 and ACT are notable examples that have expanded robots' capabilities. However, issues with model inference speed and hand position accuracy persist. High-quality training data and fast, stable inference mechanisms are essential to overcome these challenges. This paper proposes a motion generation model for high-speed, high-precision tasks, exemplified by the sports stacking task. By teaching motions slowly and inferring at high speeds, the model achieved a 94% success rate in stacking cups with a real robot.

Achieving Faster and More Accurate Operation of Deep Predictive Learning

TL;DR

Addresses the challenge of achieving both high speed and high precision in robot operations within real-world environments. Proposes HSARNNST, a motion-generation model with visual attention, a Hierarchical RNN for fused visual and motion cues, and a Softmax-based decoder to produce a discrete action distribution, enabling fast, stable inference. Training uses slow task teaching to ensure data quality while inference runs at up to three times the teaching speed, demonstrated on a real-robot sport-stacking task achieving a 0.94 success rate. The results suggest that combining high-quality data with structured, discrete-action inference can enable robust, end-to-end robotic learning suitable for socially interactive applications.

Abstract

Achieving both high speed and precision in robot operations is a significant challenge for social implementation. While factory robots excel at predefined tasks, they struggle with environment-specific actions like cleaning and cooking. Deep learning research aims to address this by enabling robots to autonomously execute behaviors through end-to-end learning with sensor data. RT-1 and ACT are notable examples that have expanded robots' capabilities. However, issues with model inference speed and hand position accuracy persist. High-quality training data and fast, stable inference mechanisms are essential to overcome these challenges. This paper proposes a motion generation model for high-speed, high-precision tasks, exemplified by the sports stacking task. By teaching motions slowly and inferring at high speeds, the model achieved a 94% success rate in stacking cups with a real robot.
Paper Structure (3 sections, 2 figures, 1 table)

This paper contains 3 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overview of model and data processing. (a) Network structure of HSARNNST. (b) Data processing of Softmax Transformation.
  • Figure 2: Experimental environment and results. (a) Robot performs the task of stacking cups in five positions, including an untaught position. (b)(c)(d) Hand trajectories of the compared models during motion generation at position C.