Table of Contents
Fetching ...

Hovering Flight of Soft-Actuated Insect-Scale Micro Aerial Vehicles using Deep Reinforcement Learning

Yi-Hsuan Hsiao, Wei-Tung Chen, Yun-Sheng Chang, Pulkit Agrawal, YuFeng Chen

TL;DR

This work tackles hovering control for soft-actuated insect-scale IMAVs plagued by system delay and model uncertainty. It introduces a delay-aware initialization using modified BC with state-action rematching and domain randomization, followed by PPO-based reinforcement learning to refine the policy in a delayed simulator. The approach enables zero-shot hovering on two different robots, achieving up to 50 seconds with lateral RMSE of $1.34$ cm and altitude RMSE of $0.05$ cm, outperforming prior scales. The results bridge the Sim2Real gap for end-to-end deep RL on soft IMAVs and highlight the feasibility of running compact neural controllers on insect-scale hardware to enable robust, high-speed flight tasks.

Abstract

Soft-actuated insect-scale micro aerial vehicles (IMAVs) pose unique challenges for designing robust and computationally efficient controllers. At the millimeter scale, fast robot dynamics ($\sim$ms), together with system delay, model uncertainty, and external disturbances significantly affect flight performances. Here, we design a deep reinforcement learning (RL) controller that addresses system delay and uncertainties. To initialize this neural network (NN) controller, we propose a modified behavior cloning (BC) approach with state-action re-matching to account for delay and domain-randomized expert demonstration to tackle uncertainty. Then we apply proximal policy optimization (PPO) to fine-tune the policy during RL, enhancing performance and smoothing commands. In simulations, our modified BC substantially increases the mean reward compared to baseline BC; and RL with PPO improves flight quality and reduces command fluctuations. We deploy this controller on two different insect-scale aerial robots that weigh 720 mg and 850 mg, respectively. The robots demonstrate multiple successful zero-shot hovering flights, with the longest lasting 50 seconds and root-mean-square errors of 1.34 cm in lateral direction and 0.05 cm in altitude, marking the first end-to-end deep RL-based flight on soft-driven IMAVs.

Hovering Flight of Soft-Actuated Insect-Scale Micro Aerial Vehicles using Deep Reinforcement Learning

TL;DR

This work tackles hovering control for soft-actuated insect-scale IMAVs plagued by system delay and model uncertainty. It introduces a delay-aware initialization using modified BC with state-action rematching and domain randomization, followed by PPO-based reinforcement learning to refine the policy in a delayed simulator. The approach enables zero-shot hovering on two different robots, achieving up to 50 seconds with lateral RMSE of cm and altitude RMSE of cm, outperforming prior scales. The results bridge the Sim2Real gap for end-to-end deep RL on soft IMAVs and highlight the feasibility of running compact neural controllers on insect-scale hardware to enable robust, high-speed flight tasks.

Abstract

Soft-actuated insect-scale micro aerial vehicles (IMAVs) pose unique challenges for designing robust and computationally efficient controllers. At the millimeter scale, fast robot dynamics (ms), together with system delay, model uncertainty, and external disturbances significantly affect flight performances. Here, we design a deep reinforcement learning (RL) controller that addresses system delay and uncertainties. To initialize this neural network (NN) controller, we propose a modified behavior cloning (BC) approach with state-action re-matching to account for delay and domain-randomized expert demonstration to tackle uncertainty. Then we apply proximal policy optimization (PPO) to fine-tune the policy during RL, enhancing performance and smoothing commands. In simulations, our modified BC substantially increases the mean reward compared to baseline BC; and RL with PPO improves flight quality and reduces command fluctuations. We deploy this controller on two different insect-scale aerial robots that weigh 720 mg and 850 mg, respectively. The robots demonstrate multiple successful zero-shot hovering flights, with the longest lasting 50 seconds and root-mean-square errors of 1.34 cm in lateral direction and 0.05 cm in altitude, marking the first end-to-end deep RL-based flight on soft-driven IMAVs.

Paper Structure

This paper contains 15 sections, 14 equations, 7 figures.

Figures (7)

  • Figure 1: An image of a 720-mg eight-wing micro-aerial-robot (left) and an 850-mg four-wing micro-aerial-robot (right) both driven by DEAs. The robot consists of either a 3D-printed or a carbon fiber airframe that connects four modules. Each module has a DEA, transmissions, wing hinges, and wings. The robot requires external systems for sensing, control, and power.
  • Figure 2: Overview of our proposed controller design. First, from a model-based controller, $\pi_{e_i}$, a set of expert demonstrations is generated with randomized domain parameters. Then, we re-match the delayed state with the action to account for system delay. We implement behavior cloning to initialize a neural network controller. Next, in the RL phase, the control policy is fine-tuned with PPO to improve performance and reduce driving command fluctuations. Finally, the controller is integrated into the Matlab Simulink Real-Time environment for demonstrating robot hovering flight.
  • Figure 3: Workflow of State-Action Re-matching. The expert demonstration is first rolled out in the undelayed simulator; then, we offset the state-action pairs by $d$ and have $(\mathbf{s}_t,\mathbf{a}_{t+d})$ as a pair for supervised learning to clone the delay-compensated controller. The policy then goes through PPO fine-tuning and is deployed to the real-world environment.
  • Figure 4: Simulation results of behavior cloning. (a) Comparison of the baseline method, the method with state-action re-matching, and the method with both state-action re-matching and domain randomization. Colored boxes show 25$\%$, 50$\%$, and 75$\%$ percentiles and the black bars show non-outlier minimum and maximum. Dots are outliers that are 1.5 interquartile range (IQR) away from the top or bottom of the box. (b) Comparison of controller performance as the randomization range increases. (c) Controller performance as a function of training data set size.
  • Figure 5: Simulation results of before and after PPO fine-tuning. (a) shows the training curve of the PPO with respect to the chosen reward function. The dark blue line shows the median rewards and the light blue shaded region represents two standard deviations away from the median. (b) displays the performance improvement in simulation after PPO fine-tuning. (c-d) compare the aggressiveness of command before and after PPO, the fluctuation in command is greatly reduced after PPO fine-tuning.
  • ...and 2 more figures