Table of Contents
Fetching ...

Efficient Onboard Vision-Language Inference in UAV-Enabled Low-Altitude Economy Networks via LLM-Enhanced Optimization

Yang Li, Ruichen Zhang, Yinqiu Liu, Guangyuan Liu, Dusit Niyato, Abbas Jamalipour, Xianbin Wang, Dong In Kim

TL;DR

This work tackles the problem of delivering real-time Vision-Language Model (VLM) inference over UAV-enabled LAENets under tight resource constraints. It introduces a hierarchical optimization framework (ARPO-LLaRA) that jointly optimizes image resolution, uplink power, and UAV trajectory, leveraging an offline LLM-designed reward to guide DRL-based trajectory planning without adding real-time latency. ARPO solves the resolution and power subproblem via Branch-and-Bound and KKT, while LLaRA uses LLM-assisted reward design to improve PPO-based trajectory optimization, achieving faster convergence and better policies. Experimental results show substantial latency reductions and robust performance across multi-user, multi-batch scenarios, with resolution-aware trade-offs captured by empirical lookup tables and bandwidth/power sensitivity analyses, highlighting practical viability for onboard inference-as-a-service in LAENets.

Abstract

The rapid advancement of Low-Altitude Economy Networks (LAENets) has enabled a variety of applications, including aerial surveillance, environmental sensing, and semantic data collection. To support these scenarios, unmanned aerial vehicles (UAVs) equipped with onboard vision-language models (VLMs) offer a promising solution for real-time multimodal inference. However, ensuring both inference accuracy and communication efficiency remains a significant challenge due to limited onboard resources and dynamic network conditions. In this paper, we first propose a UAV-enabled LAENet system model that jointly captures UAV mobility, user-UAV communication, and the onboard visual question answering (VQA) pipeline. Based on this model, we formulate a mixed-integer non-convex optimization problem to minimize task latency and power consumption under user-specific accuracy constraints. To solve the problem, we design a hierarchical optimization framework composed of two parts: (i) an Alternating Resolution and Power Optimization (ARPO) algorithm for resource allocation under accuracy constraints, and (ii) a Large Language Model-augmented Reinforcement Learning Approach (LLaRA) for adaptive UAV trajectory optimization. The large language model (LLM) serves as an expert in refining reward design of reinforcement learning in an offline fashion, introducing no additional latency in real-time decision-making. Numerical results demonstrate the efficacy of our proposed framework in improving inference performance and communication efficiency under dynamic LAENet conditions.

Efficient Onboard Vision-Language Inference in UAV-Enabled Low-Altitude Economy Networks via LLM-Enhanced Optimization

TL;DR

This work tackles the problem of delivering real-time Vision-Language Model (VLM) inference over UAV-enabled LAENets under tight resource constraints. It introduces a hierarchical optimization framework (ARPO-LLaRA) that jointly optimizes image resolution, uplink power, and UAV trajectory, leveraging an offline LLM-designed reward to guide DRL-based trajectory planning without adding real-time latency. ARPO solves the resolution and power subproblem via Branch-and-Bound and KKT, while LLaRA uses LLM-assisted reward design to improve PPO-based trajectory optimization, achieving faster convergence and better policies. Experimental results show substantial latency reductions and robust performance across multi-user, multi-batch scenarios, with resolution-aware trade-offs captured by empirical lookup tables and bandwidth/power sensitivity analyses, highlighting practical viability for onboard inference-as-a-service in LAENets.

Abstract

The rapid advancement of Low-Altitude Economy Networks (LAENets) has enabled a variety of applications, including aerial surveillance, environmental sensing, and semantic data collection. To support these scenarios, unmanned aerial vehicles (UAVs) equipped with onboard vision-language models (VLMs) offer a promising solution for real-time multimodal inference. However, ensuring both inference accuracy and communication efficiency remains a significant challenge due to limited onboard resources and dynamic network conditions. In this paper, we first propose a UAV-enabled LAENet system model that jointly captures UAV mobility, user-UAV communication, and the onboard visual question answering (VQA) pipeline. Based on this model, we formulate a mixed-integer non-convex optimization problem to minimize task latency and power consumption under user-specific accuracy constraints. To solve the problem, we design a hierarchical optimization framework composed of two parts: (i) an Alternating Resolution and Power Optimization (ARPO) algorithm for resource allocation under accuracy constraints, and (ii) a Large Language Model-augmented Reinforcement Learning Approach (LLaRA) for adaptive UAV trajectory optimization. The large language model (LLM) serves as an expert in refining reward design of reinforcement learning in an offline fashion, introducing no additional latency in real-time decision-making. Numerical results demonstrate the efficacy of our proposed framework in improving inference performance and communication efficiency under dynamic LAENet conditions.

Paper Structure

This paper contains 38 sections, 1 theorem, 35 equations, 11 figures, 2 algorithms.

Key Result

Proposition 1

The optimal solution $\mathbf{P}^*=\{P_n^*\}_{n\in\mathcal{N}}$ and $\tau^*$ to problem $\mathbb{P}_2$ are expressed as: where Specifically, the solution $\tau^*$ can be efficiently obtained using a 1-D bisection search, and the corresponding $\hat{P}_n(\hat{\tau})$ can be computed via equa:optimal_p. Thus, KKT removes the need for a multi-dimensional search, i.e., we evaluate $P_n(\tau)$ in clo

Figures (11)

  • Figure 1: An overview of the onboard VLM inference-driven LAENet. The upper part depicts a UAV serving as a flying agent that providing VLM inference services to ground users; the lower part details the onboard VLM pipeline from user queries to answer generation.
  • Figure 2: The framework of our proposed hierarchical ARPO-LLaRA optimization framework. At the start of uplink session, ARPO determine image resolutions $\mathbf{r}$ and powers $\mathbf{P}$ for transmission using the B&B algorithm and KKT conditions, respectively. Then, LLaRA uses an LLM-assisted DRL method for planning the slot-level UAV trajectory.
  • Figure 3: The workflow of LLaRA. The LLM-augmented reward design employs an LLM expert to generate and iteratively refine the candidate reward functions. The GAE-PPO strategy updates the Actor and Critic networks using the feedback provided by the refined LLM-designed reward.
  • Figure 4: Instance prompts used in the initialization and evolution of our LLM-assisted reward design.
  • Figure 5: Impact of input resolution on TextVQA. Left and middle: our reproduced experiments using LLaVA-HR show that high-resolution inputs (e.g., $1024\mathrm{p}$) preserve fine details such as the tiny "red" car, enabling correct answers that are lost when downsampled to $336\mathrm{p}$. Notably, LLaVA-HR-7B achieves accuracy comparable to much larger models (Qwen3-225B, GPT-4o). Right: published profiling results of LLaVA-HR luo2024feast, illustrating the accuracy–efficiency–size trade-off across resolutions.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof