Table of Contents
Fetching ...

Swimming Under Constraints: A Safe Reinforcement Learning Framework for Quadrupedal Bio-Inspired Propulsion

Xinyu Cui, Fei Han, Hang Xu, Yongcheng Zeng, Luoyang Sun, Ruizhi Zhang, Jian Zhao, Haifeng Zhang, Weikun Li, Hao Chen, Jun Wang, Dixia Fan

Abstract

Bio-inspired aquatic propulsion offers high thrust and maneuverability but is prone to destabilizing forces such as lift fluctuations, which are further amplified by six-degree-of-freedom (6-DoF) fluid coupling. We formulate quadrupedal swimming as a constrained optimization problem that maximizes forward thrust while minimizing destabilizing fluctuations. Our proposed framework, Accelerated Constrained Proximal Policy Optimization with a PID-regulated Lagrange multiplier (ACPPO-PID), enforces constraints with a PID-regulated Lagrange multiplier, accelerates learning via conditional asymmetric clipping, and stabilizes updates through cycle-wise geometric aggregation. Initialized with imitation learning and refined through on-hardware towing-tank experiments, ACPPO-PID produces control policies that transfer effectively to quadrupedal free-swimming trials. Results demonstrate improved thrust efficiency, reduced destabilizing forces, and faster convergence compared with state-of-the-art baselines, underscoring the importance of constraint-aware safe RL for robust and generalizable bio-inspired locomotion in complex fluid environments.

Swimming Under Constraints: A Safe Reinforcement Learning Framework for Quadrupedal Bio-Inspired Propulsion

Abstract

Bio-inspired aquatic propulsion offers high thrust and maneuverability but is prone to destabilizing forces such as lift fluctuations, which are further amplified by six-degree-of-freedom (6-DoF) fluid coupling. We formulate quadrupedal swimming as a constrained optimization problem that maximizes forward thrust while minimizing destabilizing fluctuations. Our proposed framework, Accelerated Constrained Proximal Policy Optimization with a PID-regulated Lagrange multiplier (ACPPO-PID), enforces constraints with a PID-regulated Lagrange multiplier, accelerates learning via conditional asymmetric clipping, and stabilizes updates through cycle-wise geometric aggregation. Initialized with imitation learning and refined through on-hardware towing-tank experiments, ACPPO-PID produces control policies that transfer effectively to quadrupedal free-swimming trials. Results demonstrate improved thrust efficiency, reduced destabilizing forces, and faster convergence compared with state-of-the-art baselines, underscoring the importance of constraint-aware safe RL for robust and generalizable bio-inspired locomotion in complex fluid environments.
Paper Structure (20 sections, 13 equations, 6 figures, 2 tables)

This paper contains 20 sections, 13 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Proposed framework: imitation learning initializes a periodic gait from predefined motions while sensor feedback is Kalman‑filtered; safe RL fine‑tuning accelerates on‑hardware convergence under stability constraints; and the resulting one‑cycle paddle is transferred to diagonal limb pairs with a half‑cycle phase offset to enable smooth and stable free‑swimming.
  • Figure 2: The training curves (reward and average cost) of the baselines over 400 episodes, each method is trained in three random seeds. The gray line marks the cost limit.
  • Figure 3: Ablations removing the cycle‑level objective, conditional high clipping, or imitation learning quantify each component’s contribution to stability and data efficiency. Each ablation is trained in three random seeds. The gray line marks the cost limit.
  • Figure 4: Forward swimming performance under two representative gaits per algorithm, averaged over three trials. ACPPO-PID achieves the highest scores with lower variance, while CPPO-PID, PPO, and BF show reduced thrust and greater sensitivity to gait differences.
  • Figure 5: Comparison of $F_{x}^{\text{mean}}$, $F_{z}^{\text{mean}}$, and $F_{z}^{\text{var}}$ across algorithms, showing ACPPO-PID achieves strong thrust with reduced lift fluctuations.The right panel illustrates the maximum displacement achieved in this experiment under parameterized motion, standard PPO training, and our proposed ACPPO-PID.
  • ...and 1 more figures