An Interpretable Neural Control Network with Adaptable Online Learning for Sample Efficient Robot Locomotion Learning

Arthicha Srisuchinnawong; Poramate Manoonpong

An Interpretable Neural Control Network with Adaptable Online Learning for Sample Efficient Robot Locomotion Learning

Arthicha Srisuchinnawong, Poramate Manoonpong

TL;DR

This work tackles the problem of sample-inefficient and opaque reinforcement learning for robot locomotion by introducing SME-AGOL, a framework that combines an interpretable Sequential Motion Executor (SME) with Adaptable Gradient-weighting Online Learning (AGOL). SME decomposes locomotion control into three-layer, interpretable components (Central Pattern Generator Neurons, Basis Neurons, and Output Neurons) across four key poses, while AGOL dynamically prioritizes updates to the most relevant parameters and adapts exploration online. In simulation and on a real hexapod MORF, SME-AGOL achieved substantially higher final rewards and faster learning than CPGRBF baselines, requiring roughly 10 minutes of learning on the physical robot and around 40% fewer samples in simulation. The results support the claim that interpretability can drive both sample efficiency and performance in legged locomotion, with potential for future extensions in transferability and multi-behavior switching.

Abstract

Robot locomotion learning using reinforcement learning suffers from training sample inefficiency and exhibits the non-understandable/black-box nature. Thus, this work presents a novel SME-AGOL to address such problems. Firstly, Sequential Motion Executor (SME) is a three-layer interpretable neural network, where the first produces the sequentially propagating hidden states, the second constructs the corresponding triangular bases with minor non-neighbor interference, and the third maps the bases to the motor commands. Secondly, the Adaptable Gradient-weighting Online Learning (AGOL) algorithm prioritizes the update of the parameters with high relevance score, allowing the learning to focus more on the highly relevant ones. Thus, these two components lead to an analyzable framework, where each sequential hidden state/basis represents the learned key poses/robot configuration. Compared to state-of-the-art methods, the SME-AGOL requires 40% fewer samples and receives 150% higher final reward/locomotion performance on a simulated hexapod robot, while taking merely 10 minutes of learning time from scratch on a physical hexapod robot. Taken together, this work not only proposes the SME-AGOL for sample efficient and understandable locomotion learning but also emphasizes the potential exploitation of interpretability for improving sample efficiency and learning performance.

An Interpretable Neural Control Network with Adaptable Online Learning for Sample Efficient Robot Locomotion Learning

TL;DR

Abstract

Paper Structure (15 sections, 16 equations, 17 figures, 2 tables, 3 algorithms)

This paper contains 15 sections, 16 equations, 17 figures, 2 tables, 3 algorithms.

Introduction
Interpretable Sequential Motion Executor-Adaptable Gradient-weighting Online Learning (SME-AGOL)
Sequential Motion Executor (SME) Neural Control
Central Pattern Generator Neurons (Cs)
Basis Neurons (Bs)
Output Neurons (RFs--LHs)
Adaptable Gradient-weighting Online Learning (AGOL)
Experiments and Results
Simulation Experiment
Physical Robot Experiment
Discussion and Conclusion
Example of Leg Coordination Patterns
Example of Different Leg Pattern
CPGRBF neural control
The Activities of Fully Connected Neural Network

Figures (17)

Figure 1: (a) An overview of the Sequential Motion Executor-Adaptable Gradient-weighting Online Learning (SME-AGOL) architecture, presented along with the corresponding signals obtained from three layers, and the experimental hexapod robot platform MORF morf for the study. The three layers includes a central pattern generator layer ($c_i$, red) providing the discrete (non-smooth) robot states, a basis layer ($b_i$, green) providing the smooth version of the robot states or the movement bases, and an output layer ({$M_i$} = {$RF_{1-3}$, $RM_{1-3}$, $RH_{1-3}$, $LF_{1-3}$, $LM_{1-3}$, $LH_{1-3}$}) providing the motor commands (e.g., $RF_{1-3}$, blue), which can be interpreted as the interpolated key poses. (b) The physical and simulated versions of MORF including the motor positions of a leg, their rotational axes, the RealSense tracking camera, the world frame, the robot frame, and the transformation between the world and robot frames.
Figure 2: (a) Central pattern generator neurons/internal states ($c_i$) and bases ($b_i$) obtained from SME where $w_\tau$ = (left) 0.05 and (right) 0.10. (b) Intersection between two non-neighbor (left) radial bases and (right) equivalent triangular bases. (c) Bases ($b_i$) and outputs ($RF_i$) obtained from SME where $w^{b_i}_{c_n}$ and $w^{b_i}_{c_m}$ are set as (left) 0.3$w_\tau$ and 0.09$w_\tau$, (middle) 0.5$w_\tau$ and 0.25$w_\tau$, and (right) 0.8$w_\tau$ and 0.64$w_\tau$.
Figure 3: Online locomotion learning process. Firstly, the robot interacts with the environment through the use of explored action $\tilde{a}_t$ generated from the SME neural control. Secondly, the AGOL learning rule updates the SME control parameters using the samples from the trajectory $\tau$, consisting of the reward obtained from the interaction with environment $r_t$, the network state (including the parameters $\theta$, explored parameters $\tilde{\theta}_t$, and exploration standard deviation $\sigma_\theta$), and the relevance explanation $|\text{Rel}_{\tilde{\theta}}|$ computed from the network state.
Figure 4: Mann-Whitney U test p-value corresponding to all comparison pairs in terms of (a) the final episodic reward and (b) the number of episodes taken to reach the reward of 0.2, where each cell intensity represents the significant level. The white cells indicate the pairs with p-value $\geq$ 0.05 (insignificant comparison), the light green cells indicate the pairs with low p-value, i.e., p-value $<$ 0.05 (significant comparison), and the dark green cells indicate the pairs with very low p-value, i.e., p-value $<< 0.05$ (highly significant comparison).
Figure 5: Average episodic rewards, i.e., learning curves and the corresponding min-max range (shade) obtained from five different learning algorithms validated with the CPGRBF neural control architecture mathias_cpgrbf under two implementations: (a) batch learning and (b) online learning.
...and 12 more figures

An Interpretable Neural Control Network with Adaptable Online Learning for Sample Efficient Robot Locomotion Learning

TL;DR

Abstract

An Interpretable Neural Control Network with Adaptable Online Learning for Sample Efficient Robot Locomotion Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (17)