Table of Contents
Fetching ...

A Hierarchical Agentic Framework for Autonomous Drone-Based Visual Inspection

Ethan Herron, Xian Yeow Lee, Gregory Sin, Teresa Gonzalez Diaz, Ahmed Farahat, Chetan Gupta

TL;DR

The paper tackles autonomous drone-based visual inspection in industrial settings, addressing safety, scalability, and adaptability gaps in manual and traditional drone systems. It introduces a hierarchical agentic framework with a head agent for high-level planning and worker agents per drone, coupled with a novel ReActEval reasoning loop (Reason-Act-Evaluate) to enable self-correcting, task-driven control. Through simulated experiments with multiple models and three task complexities, it systematically compares ReActEval against ReAct and Act, revealing that method effectiveness depends on model capability and task difficulty, and that higher-performing models unlock the benefits of structured reasoning. The study provides design insights for adaptive, multi-drone inspection systems and highlights tradeoffs between reasoning depth, model capacity, and task complexity, with implications for real-world deployment and future hybrid architectures.

Abstract

Autonomous inspection systems are essential for ensuring the performance and longevity of industrial assets. Recently, agentic frameworks have demonstrated significant potential for automating inspection workflows but have been limited to digital tasks. Their application to physical assets in real-world environments, however, remains underexplored. In this work, our contributions are two-fold: first, we propose a hierarchical agentic framework for autonomous drone control, and second, a reasoning methodology for individual function executions which we refer to as ReActEval. Our framework focuses on visual inspection tasks in indoor industrial settings, such as interpreting industrial readouts or inspecting equipment. It employs a multi-agent system comprising a head agent and multiple worker agents, each controlling a single drone. The head agent performs high-level planning and evaluates outcomes, while worker agents implement ReActEval to reason over and execute low-level actions. Operating entirely in natural language, ReActEval follows a plan, reason, act, evaluate cycle, enabling drones to handle tasks ranging from simple navigation (e.g., flying forward 10 meters and land) to complex high-level tasks (e.g., locating and reading a pressure gauge). The evaluation phase serves as a feedback and/or replanning stage, ensuring actions align with user objectives while preventing undesirable outcomes. We evaluate the framework in a simulated environment with two worker agents, assessing performance qualitatively and quantitatively based on task completion across varying complexity levels and workflow efficiency. By leveraging natural language processing for agent communication, our approach offers a novel, flexible, and user-accessible alternative to traditional drone-based solutions, enabling autonomous problem-solving for industrial inspection without extensive user intervention.

A Hierarchical Agentic Framework for Autonomous Drone-Based Visual Inspection

TL;DR

The paper tackles autonomous drone-based visual inspection in industrial settings, addressing safety, scalability, and adaptability gaps in manual and traditional drone systems. It introduces a hierarchical agentic framework with a head agent for high-level planning and worker agents per drone, coupled with a novel ReActEval reasoning loop (Reason-Act-Evaluate) to enable self-correcting, task-driven control. Through simulated experiments with multiple models and three task complexities, it systematically compares ReActEval against ReAct and Act, revealing that method effectiveness depends on model capability and task difficulty, and that higher-performing models unlock the benefits of structured reasoning. The study provides design insights for adaptive, multi-drone inspection systems and highlights tradeoffs between reasoning depth, model capacity, and task complexity, with implications for real-world deployment and future hybrid architectures.

Abstract

Autonomous inspection systems are essential for ensuring the performance and longevity of industrial assets. Recently, agentic frameworks have demonstrated significant potential for automating inspection workflows but have been limited to digital tasks. Their application to physical assets in real-world environments, however, remains underexplored. In this work, our contributions are two-fold: first, we propose a hierarchical agentic framework for autonomous drone control, and second, a reasoning methodology for individual function executions which we refer to as ReActEval. Our framework focuses on visual inspection tasks in indoor industrial settings, such as interpreting industrial readouts or inspecting equipment. It employs a multi-agent system comprising a head agent and multiple worker agents, each controlling a single drone. The head agent performs high-level planning and evaluates outcomes, while worker agents implement ReActEval to reason over and execute low-level actions. Operating entirely in natural language, ReActEval follows a plan, reason, act, evaluate cycle, enabling drones to handle tasks ranging from simple navigation (e.g., flying forward 10 meters and land) to complex high-level tasks (e.g., locating and reading a pressure gauge). The evaluation phase serves as a feedback and/or replanning stage, ensuring actions align with user objectives while preventing undesirable outcomes. We evaluate the framework in a simulated environment with two worker agents, assessing performance qualitatively and quantitatively based on task completion across varying complexity levels and workflow efficiency. By leveraging natural language processing for agent communication, our approach offers a novel, flexible, and user-accessible alternative to traditional drone-based solutions, enabling autonomous problem-solving for industrial inspection without extensive user intervention.

Paper Structure

This paper contains 24 sections, 6 figures, 3 tables, 4 algorithms.

Figures (6)

  • Figure 1: An overview of the hierarchical agentic framework. Users define a task for the agentic framework to complete with the drone. The Head Agent creates a plan to accomplish the user-defined task. The Worker Agent uses the ReActEval method to execute actions that accomplish the high level plan defined by the Head Agent. The Worker Agent communicates directly with the drone and other tools (VLMs, secondary task-specific models, etc.) by directly calling API functions.
  • Figure 2: Distribution of failure modes across the different methods. The analysis of ReActEval, ReAct, and Act methods revealed three primary failure modes: incorrect function calls, early stopping, and head agent failure. The proposed ReActEval method reduces the amount of incorrect function calls. Early stopping is consistent across all three methods and hints at a larger problem with the underlying LLMs. Head Agent failures, i.e., incorrect drone indexing or poor planning, is minimal and consistent across each method.
  • Figure 3: Examples of the three primary failure modes observed across all methods. The first example demonstrates incorrect/repeated function calls where the drone executes actions out of sequence and performs unnecessary operations. The second shows early stopping where the drone reaches the target location but fails to complete the full task requirements. The third illustrates head agent failure where incorrect drone availability assessment prevents any task execution.
  • Figure 4: Demonstration of the Hierarchical Agentic Framework deployed with two drones in a mock industrial setting. There are two drones operating in the upper left image. The rightmost drone is oriented towards a picture of a pressure gauge attached to a red pipe assembly. The image in the top right is a cropped version of the image taken by the drone. The 'Image Analysis' is the output description of the captured image from the Agentic Framework's Vision Language Model.
  • Figure 5: Example transcript from the ReActEval method with GPT-4.1 nano on a medium level task. In this example, ReActEval with a small model fails to accomplish the task. The failure is originally induced by an incorrect target coordinate (4,0,0) instead of (0,4,0) which propagates through the remaining steps.
  • ...and 1 more figures