Robust Iterative Value Conversion: Deep Reinforcement Learning for Neurochip-driven Edge Robots

Yuki Kadokawa; Tomohito Kodera; Yoshihisa Tsurumine; Shinya Nishimura; Takamitsu Matsubara

Robust Iterative Value Conversion: Deep Reinforcement Learning for Neurochip-driven Edge Robots

Yuki Kadokawa, Tomohito Kodera, Yoshihisa Tsurumine, Shinya Nishimura, Takamitsu Matsubara

TL;DR

The paper addresses the challenge of training DRL policies for edge robots implemented as SNNs on neurochips, where iterative policy conversion from FPNNs to SNNs causes disruptive errors. It introduces Robust Iterative Value Conversion (RIVC), which combines quantization-aware learning (matching SNN bit-width $k$, typically 4 or 8) with a gap-increasing operator that uses parameters $oldsymbol{eta}$ and $oldsymbol{ u}$ to widen the action-gap and resist conversion drift. Key contributions include a novel DRL framework for neurochip-based SNN policies, a conversion-robust policy-update mechanism, and empirical validation showing substantial energy and speed benefits on real-robot tasks (SNN policies on neurochips ~15× more power-efficient and ~5× faster than edge CPUs), while prior methods fail to train under conversion errors. These findings highlight the practical potential of on-chip learning for energy-constrained robotic applications using frame-based vision and neurochips like Akida.

Abstract

A neurochip is a device that reproduces the signal processing mechanisms of brain neurons and calculates Spiking Neural Networks (SNNs) with low power consumption and at high speed. Thus, neurochips are attracting attention from edge robot applications, which suffer from limited battery capacity. This paper aims to achieve deep reinforcement learning (DRL) that acquires SNN policies suitable for neurochip implementation. Since DRL requires a complex function approximation, we focus on conversion techniques from Floating Point NN (FPNN) because it is one of the most feasible SNN techniques. However, DRL requires conversions to SNNs for every policy update to collect the learning samples for a DRL-learning cycle, which updates the FPNN policy and collects the SNN policy samples. Accumulative conversion errors can significantly degrade the performance of the SNN policies. We propose Robust Iterative Value Conversion (RIVC) as a DRL that incorporates conversion error reduction and robustness to conversion errors. To reduce them, FPNN is optimized with the same number of quantization bits as an SNN. The FPNN output is not significantly changed by quantization. To robustify the conversion error, an FPNN policy that is applied with quantization is updated to increase the gap between the probability of selecting the optimal action and other actions. This step prevents unexpected replacements of the policy's optimal actions. We verified RIVC's effectiveness on a neurochip-driven robot. The results showed that RIVC consumed 1/15 times less power and increased the calculation speed by five times more than an edge CPU (quad-core ARM Cortex-A72). The previous framework with no countermeasures against conversion errors failed to train the policies. Videos from our experiments are available: https://youtu.be/Q5Z0-BvK1Tc.

Robust Iterative Value Conversion: Deep Reinforcement Learning for Neurochip-driven Edge Robots

TL;DR

, typically 4 or 8) with a gap-increasing operator that uses parameters

and

to widen the action-gap and resist conversion drift. Key contributions include a novel DRL framework for neurochip-based SNN policies, a conversion-robust policy-update mechanism, and empirical validation showing substantial energy and speed benefits on real-robot tasks (SNN policies on neurochips ~15× more power-efficient and ~5× faster than edge CPUs), while prior methods fail to train under conversion errors. These findings highlight the practical potential of on-chip learning for energy-constrained robotic applications using frame-based vision and neurochips like Akida.

Abstract

Paper Structure (38 sections, 19 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 38 sections, 19 equations, 9 figures, 3 tables, 1 algorithm.

Introduction
Related Works
Definition of Learning Settings
Reinforcement Learning with Spiking Neural Networks
Reward-modulated Spike Timing Dependent Plasticity (R-STDP)
Training SNN-policy with Backpropagation (SNN-BP)
DRL-policy to SNN-policy Conversion (DRL2SNN)
Comparison with Our Method
Robotic Applications of Spiking Neural Networks and Neurochips
Dynamic Vision Sensor
Low-Dimensional Sensors
Frame-Based Camera
Preliminaries
Reinforcement Learning
Gap-Increasing Operator
...and 23 more sections

Figures (9)

Figure 1: Learning scheme for neurochip-driven robot policies in proposed framework, which trains a policy of a neurochip-driven robot in real-world interactions. First, we create an edge-server-learning system that updates policies in the server and in the sample dataset in a neurochip-driven robot. Learning flow: 1) server updates policies; 2) server sends them to neurochip; 3) neurochip-driven robot samples learning dataset; 4) neurochip-driven robot sends samples to server. This cycle is conducted until policy converges.
Figure 2: RIVC's learning framework: 1) policy updates, 2) SNN conversion, and 3) a sampling dataset: 1) This step is proposed framework's main part. This update scheme trains QNN policies that prevent maximum action of policies from being replaced due to SNN conversion by increasing value gap of QNN policies between maximum action of policies and other actions. First step obtains a value of the QNN policy. Next update scheme estimates loss function by determining target value (including gap-increasing operator) to increase differences between estimated maximum action and other actions. FPNN policies are updated based on estimated loss function to more accurately calculate gradient with FPNN parameters with larger bits than QNN parameters. Updated FPNN parameters are quantized to QNN parameters, including noise injection into former to stabilize parameter updates. 2) Trained QNN policy is converted to SNN policy. 3) Neurochip-driven robots collect samples by SNN policy. Above three steps are repeated until policy converges.
Figure 3: Difference between ReLU and $\text{ReLU}^q$: $k$ denotes quantization bit number. $\sigma$ denotes output scaling factor. "Grid Size" indicates quantization interval.
Figure 4: Simulation task settings of visual-servo task: Task environment is comprised of a ball (target object), a camera frame (agent), and a task field. Task objective is to track the ball by the camera frame. Agent's action is moving on horizon axis. Observation is three consecutive images in camera frames obtained by conversion from RGB to grayscale images. H, W, and R denote height, width, and radius expressed by pixels. RGB denotes RGB color values from 0 to 255.
Figure 5: Comparison of learning methods: a) CartPole and b) Visual Servo. Four-bit quantization is applied to RIVC and DRL2SNN. Each figure curve plots mean and variance per sample over five experiments.
...and 4 more figures

Robust Iterative Value Conversion: Deep Reinforcement Learning for Neurochip-driven Edge Robots

TL;DR

Abstract

Robust Iterative Value Conversion: Deep Reinforcement Learning for Neurochip-driven Edge Robots

Authors

TL;DR

Abstract

Table of Contents

Figures (9)