Table of Contents
Fetching ...

Design and Evaluation of a Multi-Agent Perception System for Autonomous Flying Networks

Diogo Ferreira, Pedro Ribeiro, André Coelho, Rui Campos

TL;DR

MAPS addresses the gap of autonomous perception in Flying Networks by fusing visual and audio inputs through MM-LLMs and a three-agent Brain to produce structured SLSs for zero-touch network control. It demonstrates near real-time performance and reasonable accuracy on a synthetic emergency dataset, while revealing latency bottlenecks dominated by LLM API interactions. The work also contributes a reproducible multimodal synthetic dataset and analyzes practical deployment considerations, including edge computing to reduce latency. Overall, MAPS advances autonomous sensing and decision-making for responsive, infrastructure-light FN operations.

Abstract

Autonomous Flying Networks (FNs) are emerging as a key enabler of on-demand connectivity in dynamic and infrastructure-limited environments. However, current approaches mainly focus on UAV placement, routing, and resource management, neglecting the autonomous perception of users and their service demands - a critical capability for zero-touch network operation. This paper presents the Multi-Agent Perception System (MAPS), a modular and scalable system that leverages multi-modal large language models (MM-LLMs) and agentic Artificial Intelligence (AI) to interpret visual and audio data collected by UAVs and generate Service Level Specifications (SLSs) describing user count, spatial distribution, and traffic demand. MAPS is evaluated using a synthetic multimodal emergency dataset, achieving user detection accuracies above 70% and SLS generation under 130 seconds in 90% of cases. Results demonstrate that combining audio and visual modalities enhances user detection and show that MAPS provides the perception layer required for autonomous, zero-touch FNs.

Design and Evaluation of a Multi-Agent Perception System for Autonomous Flying Networks

TL;DR

MAPS addresses the gap of autonomous perception in Flying Networks by fusing visual and audio inputs through MM-LLMs and a three-agent Brain to produce structured SLSs for zero-touch network control. It demonstrates near real-time performance and reasonable accuracy on a synthetic emergency dataset, while revealing latency bottlenecks dominated by LLM API interactions. The work also contributes a reproducible multimodal synthetic dataset and analyzes practical deployment considerations, including edge computing to reduce latency. Overall, MAPS advances autonomous sensing and decision-making for responsive, infrastructure-light FN operations.

Abstract

Autonomous Flying Networks (FNs) are emerging as a key enabler of on-demand connectivity in dynamic and infrastructure-limited environments. However, current approaches mainly focus on UAV placement, routing, and resource management, neglecting the autonomous perception of users and their service demands - a critical capability for zero-touch network operation. This paper presents the Multi-Agent Perception System (MAPS), a modular and scalable system that leverages multi-modal large language models (MM-LLMs) and agentic Artificial Intelligence (AI) to interpret visual and audio data collected by UAVs and generate Service Level Specifications (SLSs) describing user count, spatial distribution, and traffic demand. MAPS is evaluated using a synthetic multimodal emergency dataset, achieving user detection accuracies above 70% and SLS generation under 130 seconds in 90% of cases. Results demonstrate that combining audio and visual modalities enhances user detection and show that MAPS provides the perception layer required for autonomous, zero-touch FNs.

Paper Structure

This paper contains 12 sections, 1 equation, 8 figures.

Figures (8)

  • Figure 1: Illustrative example of a Flying Network providing on-demand wireless connectivity to first responders in a disaster management scenario Fire_situation. The UAVs act as aerial access points, extending coverage and enabling communication in the absence of fixed infrastructure.
  • Figure 2: Reference FN architecture with MAPS deployed in cloud/edge computing. MAPS is responsible for providing the FN with the necessary data for FN decision-making. Adapted from 1_ribeiro_supply_2024.
  • Figure 3: Overview of the MAPS architecture. The Perception layer pre-processes multimodal inputs; the Brain integrates three agents (Image, Audio, Fusion) for reasoning; and the Action layer generates structured Service Level Specifications (SLSs) for FN control.
  • Figure 4: Examples of synthetically generated emergency scenarios from the dataset available in Cardona_MAPS-Dataset_2025, illustrating disaster management scenes. The dataset combines visual and audio modalities for MAPS evaluation.
  • Figure 5: Comparison between the number of users detected by MAPS and the ground truth across all scenarios, considering 10 runs per scenario. The ground truth includes both visible individuals and objects typically associated with user presence (e.g., vehicles).
  • ...and 3 more figures