Table of Contents
Fetching ...

CERNet: Class-Embedding Predictive-Coding RNN for Unified Robot Motion, Recognition, and Confidence Estimation

Hiroki Sawada, Alexandre Pitti, Mathias Quoy

TL;DR

The paper tackles the challenge of enabling robots to simultaneously generate learned motions, infer human or demonstrator intent, and estimate the system's own confidence in real time. It introduces CERNet, a multi-layer predictive-coding RNN with a dynamically updated class-embedding vector that unifies generation and recognition within a single closed-loop model, validated on a Reachy humanoid. Across 26 alphabet trajectories, CERNet achieves a 76% reduction in reproduction error versus a parameter-matched single-layer baseline, demonstrates robustness to external perturbations, and attains real-time recognition with 68% Top-1 and 81% Top-2 accuracy; importantly, internal prediction error serves as an intrinsic confidence signal. This work provides a compact, extensible approach to motor memory and intent-aware human–robot collaboration, with potential extensions to online learning and multimodal sensing.

Abstract

Robots interacting with humans must not only generate learned movements in real-time, but also infer the intent behind observed behaviors and estimate the confidence of their own inferences. This paper proposes a unified model that achieves all three capabilities within a single hierarchical predictive-coding recurrent neural network (PC-RNN) equipped with a class embedding vector, CERNet, which leverages a dynamically updated class embedding vector to unify motor generation and recognition. The model operates in two modes: generation and inference. In the generation mode, the class embedding constrains the hidden state dynamics to a class-specific subspace; in the inference mode, it is optimized online to minimize prediction error, enabling real-time recognition. Validated on a humanoid robot across 26 kinesthetically taught alphabets, our hierarchical model achieves 76% lower trajectory reproduction error than a parameter-matched single-layer baseline, maintains motion fidelity under external perturbations, and infers the demonstrated trajectory class online with 68% Top-1 and 81% Top-2 accuracy. Furthermore, internal prediction errors naturally reflect the model's confidence in its recognition. This integration of robust generation, real-time recognition, and intrinsic uncertainty estimation within a compact PC-RNN framework offers a compact and extensible approach to motor memory in physical robots, with potential applications in intent-sensitive human-robot collaboration.

CERNet: Class-Embedding Predictive-Coding RNN for Unified Robot Motion, Recognition, and Confidence Estimation

TL;DR

The paper tackles the challenge of enabling robots to simultaneously generate learned motions, infer human or demonstrator intent, and estimate the system's own confidence in real time. It introduces CERNet, a multi-layer predictive-coding RNN with a dynamically updated class-embedding vector that unifies generation and recognition within a single closed-loop model, validated on a Reachy humanoid. Across 26 alphabet trajectories, CERNet achieves a 76% reduction in reproduction error versus a parameter-matched single-layer baseline, demonstrates robustness to external perturbations, and attains real-time recognition with 68% Top-1 and 81% Top-2 accuracy; importantly, internal prediction error serves as an intrinsic confidence signal. This work provides a compact, extensible approach to motor memory and intent-aware human–robot collaboration, with potential extensions to online learning and multimodal sensing.

Abstract

Robots interacting with humans must not only generate learned movements in real-time, but also infer the intent behind observed behaviors and estimate the confidence of their own inferences. This paper proposes a unified model that achieves all three capabilities within a single hierarchical predictive-coding recurrent neural network (PC-RNN) equipped with a class embedding vector, CERNet, which leverages a dynamically updated class embedding vector to unify motor generation and recognition. The model operates in two modes: generation and inference. In the generation mode, the class embedding constrains the hidden state dynamics to a class-specific subspace; in the inference mode, it is optimized online to minimize prediction error, enabling real-time recognition. Validated on a humanoid robot across 26 kinesthetically taught alphabets, our hierarchical model achieves 76% lower trajectory reproduction error than a parameter-matched single-layer baseline, maintains motion fidelity under external perturbations, and infers the demonstrated trajectory class online with 68% Top-1 and 81% Top-2 accuracy. Furthermore, internal prediction errors naturally reflect the model's confidence in its recognition. This integration of robust generation, real-time recognition, and intrinsic uncertainty estimation within a compact PC-RNN framework offers a compact and extensible approach to motor memory in physical robots, with potential applications in intent-sensitive human-robot collaboration.

Paper Structure

This paper contains 22 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Schematic illustration of CERNet. The model integrates top-down predictions and bottom-up errors across hidden states in multiple layers. Blue and red arrows indicate computation for forward propagation and error propagation, respectively. Refer to Eqs. \ref{['eq:ClassEmbeddingVector']}--\ref{['eq:adam_clip']} for detailed explanation of each variable.
  • Figure 2: Reproduction of the letters b, e, k, l, m by the best-performing networks of each model type. Each row corresponds to a different model scale and experimental condition (simulation or real robot), while columns compare single-layer and multi-layer architectures. Dotted lines indicate the original training trajectories, and solid lines represent the generated motions by the models.
  • Figure 3: Perturbation recovery during alphabet reproduction on Reachy using multi-layered CERNet model. (a) and (b): Layer-wise prediction errors with respect to time while reproducing letter p. The grey area indicates the timesteps where perturbation was injected. (c) End-effector trajectory while reproducing letter p. (d) Predicted trajectories while drawing letter g, showing how the prediction is gradually corrected over time as internal states are updated.
  • Figure 4: Time development of the class embedding vector of CERNet in inference mode. The top row illustrates the time development of the prediction and the observation from the robot. Whereas the bottom row illustrates the time development of the intrinsic prediction of the observed motion.
  • Figure 5: Final mean squared error (MSE) of the internal prediction error, averaged over time, grouped by recognition outcome. Each boxplot shows the distribution of reconstruction error across trials where the correct class was identified at Top-1 (left), at Top-2 (middle), or was not among the top two predictions (right). Lower errors correspond to higher inference accuracy, indicating that the model's prediction error implicitly reflects its confidence.