Deep deterministic policy gradient with symmetric data augmentation for lateral attitude tracking control of a fixed-wing aircraft

Yifei Li; Erik-Jan van Kampen

Deep deterministic policy gradient with symmetric data augmentation for lateral attitude tracking control of a fixed-wing aircraft

Yifei Li, Erik-Jan van Kampen

TL;DR

This work tackles offline reinforcement learning for fixed-wing flight control by exploiting dynamical symmetry to enhance sample efficiency. It proposes a symmetric data augmentation (SDA) scheme and a dual-critic framework, including a two-step approximate value iteration (AVI), to better utilize augmented data in Deep Deterministic Policy Gradient (DDPG) learning. The authors also analyze the aircraft’s symmetry to justify augmentation and integrate action-smoothness regularizers (CAPS) to improve robustness. Simulation results show faster policy convergence and improved state-space coverage and attitude-tracking performance for DDPG-SDA and DDPG-SCA (symmetric critic augmentation) compared with standard DDPG. Overall, the approach reduces exploration demands while delivering reliable control performance in unvisited regions, highlighting symmetry as a practical tool for sample-efficient offline RL in aerospace control, with potential for broader physics-informed RL applications.

Abstract

The symmetry of dynamical systems can be exploited for state-transition prediction and to facilitate control policy optimization. This paper leverages system symmetry to develop sample-efficient offline reinforcement learning (RL) approaches. Under the symmetry assumption for a Markov Decision Process (MDP), a symmetric data augmentation method is proposed. The augmented samples are integrated into the dataset of Deep Deterministic Policy Gradient (DDPG) to enhance its coverage rate of the state-action space. Furthermore, sample utilization efficiency is improved by introducing a second critic trained on the augmented samples, resulting in a dual-critic structure. The aircraft's model is verified to be symmetric, and flight control simulations demonstrate accelerated policy convergence when augmented samples are employed.

Deep deterministic policy gradient with symmetric data augmentation for lateral attitude tracking control of a fixed-wing aircraft

TL;DR

Abstract

Paper Structure (30 sections, 2 theorems, 81 equations, 10 figures, 4 tables)

This paper contains 30 sections, 2 theorems, 81 equations, 10 figures, 4 tables.

Introduction
Related Work
Symmetric data augmentation for RL
DDPG/TD3 for flight control
FOUNDATIONS
Definitions
Policy optimization based on Q function
Exact value iteration
Approximate value iteration
Symmetric Dynamical Model
DDPG with Symmetric Data Augmentation
Symmetric data augmentation
DDPG with symmetric data augmentation
DDPG with Symmetric Critic Augmentation
Drawback of DDPG-SDA
...and 15 more sections

Key Result

Theorem 1

(Symmetry of $x_{t+1}$) Select two samples from the system nonlinear_system, denoted as $(x_{t},a_{t},x_{t+1})$, $(x^{\prime}_{t},a^{\prime}_{t},x^{\prime}_{t+1})$, and a reference state $x=x^{*}$. Assuming Equations symxt, symat hold, then $x_{t+1},x^{\prime}_{t+1}$ are symmetric with respect to $x

Figures (10)

Figure 1: A sketch map of symmetry in state-action set $\mathcal{S}\times\mathcal{A}$ of a dynamical system. The whole state-action set $\mathcal{S}\times\mathcal{A}$ contains two subsets: (1) state-action set of explored samples, denoted as $(\mathcal{S}\times\mathcal{A})_{\text{exp}}$; (2) state-action set of augmented samples, denoted as $(\mathcal{S}\times\mathcal{A})_{\text{aug}}$
Figure 2: Brief flowcharts of DDPG-SDA (left) and DDPG-SCA (right). Both methods employ symmetric data augmentation to generate additional samples. DDPG-SDA stores both types of samples in a single replay buffer, whereas DDPG-SCA stores them separately in two replay buffers and adopts a dual-critic structure to enable separate training with explored and augmented samples.
Figure 3: The network architectures of the critic and actor. The critic takes all the states, actions and the tracking error of bank angle as input, and outputs the estimated state–action values, with a $-\text{abs}(\cdot)$ activation function applied at the output layer to ensure the negative definiteness of the estimated state-action value function. The actor outputs the control surface deflections $\delta_{a}$ and $\delta_{r}$. A scaled $\tanh(\cdot)$ activation function is applied to constrain the actions within the actuator limits. The target critic and actor share the same architectures as their corresponding primary networks.
Figure 4: The architecture of the RL-based flight control system.
Figure 5: Bank angle references during training and operation. The training reference varies randomly in amplitude within $[0,20^{\circ}]$ with each value held for 3s.
...and 5 more figures

Theorems & Definitions (13)

Definition 1
Definition 2
Definition 3
Theorem 1
proof
Definition 4
Definition 5
Definition 6
Remark 1
Remark 2
...and 3 more

Deep deterministic policy gradient with symmetric data augmentation for lateral attitude tracking control of a fixed-wing aircraft

TL;DR

Abstract

Deep deterministic policy gradient with symmetric data augmentation for lateral attitude tracking control of a fixed-wing aircraft

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (13)