Imitation Learning Inputting Image Feature to Each Layer of Neural Network
Koki Yamane, Sho Sakaino, Toshiaki Tsuji
TL;DR
The paper tackles multimodal imitation learning where high-dimensional image data can be overshadowed by strongly correlated signals such as joint angles. It introduces the Each Layer Input method, feeding image features into every layer of a CNN-LSTM architecture to amplify the influence of weakly correlated data and improve learning at high frequencies. Empirical results on a simple pick-and-place task show substantial increases in success rates for both CNN+MLP and CNN+SpatialSoftmax variants, with attribution analyses revealing stronger input influence and insightful gradient behavior, particularly in memory cells. This approach offers a practical strategy to leverage diverse data sources in real-time robotic imitation, and opens avenues for exploring partial or selective layer inputs in future work.
Abstract
Imitation learning enables robots to learn and replicate human behavior from training data. Recent advances in machine learning enable end-to-end learning approaches that directly process high-dimensional observation data, such as images. However, these approaches face a critical challenge when processing data from multiple modalities, inadvertently ignoring data with a lower correlation to the desired output, especially when using short sampling periods. This paper presents a useful method to address this challenge, which amplifies the influence of data with a relatively low correlation to the output by inputting the data into each neural network layer. The proposed approach effectively incorporates diverse data sources into the learning process. Through experiments using a simple pick-and-place operation with raw images and joint information as input, significant improvements in success rates are demonstrated even when dealing with data from short sampling periods.
