Achieving Faster and More Accurate Operation of Deep Predictive Learning
Masaki Yoshikawa, Hiroshi Ito, Tetsuya Ogata
TL;DR
Addresses the challenge of achieving both high speed and high precision in robot operations within real-world environments. Proposes HSARNNST, a motion-generation model with visual attention, a Hierarchical RNN for fused visual and motion cues, and a Softmax-based decoder to produce a discrete action distribution, enabling fast, stable inference. Training uses slow task teaching to ensure data quality while inference runs at up to three times the teaching speed, demonstrated on a real-robot sport-stacking task achieving a 0.94 success rate. The results suggest that combining high-quality data with structured, discrete-action inference can enable robust, end-to-end robotic learning suitable for socially interactive applications.
Abstract
Achieving both high speed and precision in robot operations is a significant challenge for social implementation. While factory robots excel at predefined tasks, they struggle with environment-specific actions like cleaning and cooking. Deep learning research aims to address this by enabling robots to autonomously execute behaviors through end-to-end learning with sensor data. RT-1 and ACT are notable examples that have expanded robots' capabilities. However, issues with model inference speed and hand position accuracy persist. High-quality training data and fast, stable inference mechanisms are essential to overcome these challenges. This paper proposes a motion generation model for high-speed, high-precision tasks, exemplified by the sports stacking task. By teaching motions slowly and inferring at high speeds, the model achieved a 94% success rate in stacking cups with a real robot.
