Table of Contents
Fetching ...

Robot at the Mirror: Learning to Imitate via Associating Self-supervised Models

Andrej Lucny, Kristina Malinovska, Igor Farkas

TL;DR

This work tackles rapid, on-board imitation learning by leveraging ready-made self-supervised models and a transformer-style association mechanism to map perception to action without fine-tuning. It introduces a mirror-based setup where an image encoder and a pose VAE are paired via attention over key–value associations, enabling the robot to infer 3D body pose from vision using $A(q,K,V) = \text{softmax}\left(\frac{qK^T}{d}\right) V$. The approach is fully automated and evaluated by comparing two robots, reporting a normalised mean absolute error (NMAE) of $5.0\%$ under certain settings and as low as $1.14\%$ when testing fixed postures, demonstrating robust, parameter-sensitive performance without additional training. This work offers a practical, parameter-tunable framework for fast skill acquisition in robotics and provides a reproducible path to study the effects of association scale and memory size, with code available at the project repository.

Abstract

We introduce an approach to building a custom model from ready-made self-supervised models via their associating instead of training and fine-tuning. We demonstrate it with an example of a humanoid robot looking at the mirror and learning to detect the 3D pose of its own body from the image it perceives. To build our model, we first obtain features from the visual input and the postures of the robot's body via models prepared before the robot's operation. Then, we map their corresponding latent spaces by a sample-efficient robot's self-exploration at the mirror. In this way, the robot builds the solicited 3D pose detector, which quality is immediately perfect on the acquired samples instead of obtaining the quality gradually. The mapping, which employs associating the pairs of feature vectors, is then implemented in the same way as the key-value mechanism of the famous transformer models. Finally, deploying our model for imitation to a simulated robot allows us to study, tune up, and systematically evaluate its hyperparameters without the involvement of the human counterpart, advancing our previous research.

Robot at the Mirror: Learning to Imitate via Associating Self-supervised Models

TL;DR

This work tackles rapid, on-board imitation learning by leveraging ready-made self-supervised models and a transformer-style association mechanism to map perception to action without fine-tuning. It introduces a mirror-based setup where an image encoder and a pose VAE are paired via attention over key–value associations, enabling the robot to infer 3D body pose from vision using . The approach is fully automated and evaluated by comparing two robots, reporting a normalised mean absolute error (NMAE) of under certain settings and as low as when testing fixed postures, demonstrating robust, parameter-sensitive performance without additional training. This work offers a practical, parameter-tunable framework for fast skill acquisition in robotics and provides a reproducible path to study the effects of association scale and memory size, with code available at the project repository.

Abstract

We introduce an approach to building a custom model from ready-made self-supervised models via their associating instead of training and fine-tuning. We demonstrate it with an example of a humanoid robot looking at the mirror and learning to detect the 3D pose of its own body from the image it perceives. To build our model, we first obtain features from the visual input and the postures of the robot's body via models prepared before the robot's operation. Then, we map their corresponding latent spaces by a sample-efficient robot's self-exploration at the mirror. In this way, the robot builds the solicited 3D pose detector, which quality is immediately perfect on the acquired samples instead of obtaining the quality gradually. The mapping, which employs associating the pairs of feature vectors, is then implemented in the same way as the key-value mechanism of the famous transformer models. Finally, deploying our model for imitation to a simulated robot allows us to study, tune up, and systematically evaluate its hyperparameters without the involvement of the human counterpart, advancing our previous research.
Paper Structure (9 sections, 6 equations, 3 figures, 1 algorithm)

This paper contains 9 sections, 6 equations, 3 figures, 1 algorithm.

Figures (3)

  • Figure 1: Left: Visualization of the pose latent space with topographic organization. Right: Development of the key--value pairs over time. The red points represent the collected keys, and the green ones are redundant.
  • Figure 2: An example of the learning imitation at the mirror via association. Top: The testing postures and their points in the latent space. Bottom: Imitated poses.
  • Figure 3: The dependence of NMAE on the number of key-value pairs (left) and on the scaling factor of the association mechanism (right).