Table of Contents
Fetching ...

Learn for Variation: Variationally Guided AAV Trajectory Learning in Differentiable Environments

Xiucheng Wang, Zhenye Chen, Nan Cheng

Abstract

Autonomous aerial vehicles (AAVs) empower sixth-generation (6G) Internet-of-Things (IoT) networks through mobility-driven data collection. However, conventional reward-driven reinforcement learning for AAV trajectory planning suffers from severe credit assignment issues and training instability, because sparse scalar rewards fail to capture the long-term and nonlinear effects of sequential movements. To address these challenges, this paper proposes Learn for Variation (L4V), a gradient-informed trajectory learning framework that replaces high-variance scalar reward signals with dense and analytically grounded policy gradients. Particularly, the coupled evolution of AAV kinematics, distance-dependent channel gains, and per-user data-collection progress is first unrolled into an end-to-end differentiable computational graph. Backpropagation through time then serves as a discrete adjoint solver, which propagates exact sensitivities from the cumulative mission objective to every control action and policy parameter. These structured gradients are used to train a deterministic neural policy with temporal smoothness regularization and gradient clipping. Extensive simulations demonstrate that L4V consistently outperforms representative baselines, including a genetic algorithm, DQN, A2C, and DDPG, in mission completion time, average transmission rate, and training cost

Learn for Variation: Variationally Guided AAV Trajectory Learning in Differentiable Environments

Abstract

Autonomous aerial vehicles (AAVs) empower sixth-generation (6G) Internet-of-Things (IoT) networks through mobility-driven data collection. However, conventional reward-driven reinforcement learning for AAV trajectory planning suffers from severe credit assignment issues and training instability, because sparse scalar rewards fail to capture the long-term and nonlinear effects of sequential movements. To address these challenges, this paper proposes Learn for Variation (L4V), a gradient-informed trajectory learning framework that replaces high-variance scalar reward signals with dense and analytically grounded policy gradients. Particularly, the coupled evolution of AAV kinematics, distance-dependent channel gains, and per-user data-collection progress is first unrolled into an end-to-end differentiable computational graph. Backpropagation through time then serves as a discrete adjoint solver, which propagates exact sensitivities from the cumulative mission objective to every control action and policy parameter. These structured gradients are used to train a deterministic neural policy with temporal smoothness regularization and gradient clipping. Extensive simulations demonstrate that L4V consistently outperforms representative baselines, including a genetic algorithm, DQN, A2C, and DDPG, in mission completion time, average transmission rate, and training cost
Paper Structure (17 sections, 32 equations, 4 figures, 1 algorithm)

This paper contains 17 sections, 32 equations, 4 figures, 1 algorithm.

Figures (4)

  • Figure 1: Illustration of the proposed L4V framework. The diagram illustrates the end-to-end differentiable simulation architecture unrolled over time steps. The forward pass (black arrows) generates the AAV trajectory and accumulates the task loss $\mathcal{L}$ based on the system dynamics $F(x, \theta)$. The backward pass (red arrows) executes BPTT, which serves as a discrete adjoint solver to propagate analytical gradients $\nabla_\theta \mathcal{L}$ directly from the global objective to the policy parameters, ensuring physically grounded learning signals.
  • Figure 2: Performance scalability analysis under varying environmental scales. The proposed L4V framework is compared against baselines in terms of (a) communication quality, (b) task completion efficiency, and (c) computational training cost.
  • Figure 3: Performance robustness analysis under varying noise power levels $\sigma$. The proposed L4V framework is compared against baselines in terms of (a) communication quality, (b) task completion efficiency, and (c) computational training cost.
  • Figure 4: Performance scalability analysis with respect to user density. The proposed L4V framework is evaluated against baselines as the number of users increases in terms of (a) communication quality, (b) task completion efficiency, and (c) computational training cost.