Robot at the Mirror: Learning to Imitate via Associating Self-supervised Models
Andrej Lucny, Kristina Malinovska, Igor Farkas
TL;DR
This work tackles rapid, on-board imitation learning by leveraging ready-made self-supervised models and a transformer-style association mechanism to map perception to action without fine-tuning. It introduces a mirror-based setup where an image encoder and a pose VAE are paired via attention over key–value associations, enabling the robot to infer 3D body pose from vision using $A(q,K,V) = \text{softmax}\left(\frac{qK^T}{d}\right) V$. The approach is fully automated and evaluated by comparing two robots, reporting a normalised mean absolute error (NMAE) of $5.0\%$ under certain settings and as low as $1.14\%$ when testing fixed postures, demonstrating robust, parameter-sensitive performance without additional training. This work offers a practical, parameter-tunable framework for fast skill acquisition in robotics and provides a reproducible path to study the effects of association scale and memory size, with code available at the project repository.
Abstract
We introduce an approach to building a custom model from ready-made self-supervised models via their associating instead of training and fine-tuning. We demonstrate it with an example of a humanoid robot looking at the mirror and learning to detect the 3D pose of its own body from the image it perceives. To build our model, we first obtain features from the visual input and the postures of the robot's body via models prepared before the robot's operation. Then, we map their corresponding latent spaces by a sample-efficient robot's self-exploration at the mirror. In this way, the robot builds the solicited 3D pose detector, which quality is immediately perfect on the acquired samples instead of obtaining the quality gradually. The mapping, which employs associating the pairs of feature vectors, is then implemented in the same way as the key-value mechanism of the famous transformer models. Finally, deploying our model for imitation to a simulated robot allows us to study, tune up, and systematically evaluate its hyperparameters without the involvement of the human counterpart, advancing our previous research.
