Transformer-XL for Long Sequence Tasks in Robotic Learning from Demonstration
Gao Tianci
TL;DR
Transformer-XL is applied to robotic Learning from Demonstrations to handle long sequences across multimodal sensors (RGB-D, LiDAR, tactile). The method fuses features via ResNet-based extraction and uses Transformer-XL with position encoding and sparse attention to encode sequences, followed by a Q-learning/PPO-based action planning. Training uses behavior cloning and PPO with data augmentation; experiments on RoboMimic show competitive or superior task success rates and execution times compared to LSTM and CNN baselines, validating the approach's effectiveness. The work demonstrates a scalable framework for long-horizon robotic tasks with rich sensor inputs, pointing toward more robust perception and decision-making in real-world LfD.
Abstract
This paper presents an innovative application of Transformer-XL for long sequence tasks in robotic learning from demonstrations (LfD). The proposed framework effectively integrates multi-modal sensor inputs, including RGB-D images, LiDAR, and tactile sensors, to construct a comprehensive feature vector. By leveraging the advanced capabilities of Transformer-XL, particularly its attention mechanism and position encoding, our approach can handle the inherent complexities and long-term dependencies of multi-modal sensory data. The results of an extensive empirical evaluation demonstrate significant improvements in task success rates, accuracy, and computational efficiency compared to conventional methods such as Long Short-Term Memory (LSTM) networks and Convolutional Neural Networks (CNNs). The findings indicate that the Transformer-XL-based framework not only enhances the robot's perception and decision-making abilities but also provides a robust foundation for future advancements in robotic learning from demonstrations.
