Table of Contents
Fetching ...

Large Models in Dialogue for Active Perception and Anomaly Detection

Tzoulio Chamiti, Nikolaos Passalis, Anastasios Tefas

TL;DR

The paper tackles active perception and anomaly detection for autonomous drones in open-world environments. It introduces a dialogue-based framework where a large language model, $f(\mathbf{A}, \mathbf{C})$, issues movement commands and exploratory questions based on VQA outputs, while a Visual Question Answering model, $g(\mathbf{Q}, \mathbf{I})$, provides answers and captions for the current image $\mathbf{I}$. The approach uses a three-phase pipeline—Active Perception, Validation, and Explanation—augmented with GradCAM attention maps to produce an explainable scene description and hazard alerts. Experiments in the AirSim simulator across diverse environments demonstrate improved caption-image alignment and anomaly detection accuracy without fine-tuning, highlighting practical potential for safe, open-world aerial monitoring.

Abstract

Autonomous aerial monitoring is an important task aimed at gathering information from areas that may not be easily accessible by humans. At the same time, this task often requires recognizing anomalies from a significant distance or not previously encountered in the past. In this paper, we propose a novel framework that leverages the advanced capabilities provided by Large Language Models (LLMs) to actively collect information and perform anomaly detection in novel scenes. To this end, we propose an LLM based model dialogue approach, in which two deep learning models engage in a dialogue to actively control a drone to increase perception and anomaly detection accuracy. We conduct our experiments in a high fidelity simulation environment where an LLM is provided with a predetermined set of natural language movement commands mapped into executable code functions. Additionally, we deploy a multimodal Visual Question Answering (VQA) model charged with the task of visual question answering and captioning. By engaging the two models in conversation, the LLM asks exploratory questions while simultaneously flying a drone into different parts of the scene, providing a novel way to implement active perception. By leveraging LLMs reasoning ability, we output an improved detailed description of the scene going beyond existing static perception approaches. In addition to information gathering, our approach is utilized for anomaly detection and our results demonstrate the proposed methods effectiveness in informing and alerting about potential hazards.

Large Models in Dialogue for Active Perception and Anomaly Detection

TL;DR

The paper tackles active perception and anomaly detection for autonomous drones in open-world environments. It introduces a dialogue-based framework where a large language model, , issues movement commands and exploratory questions based on VQA outputs, while a Visual Question Answering model, , provides answers and captions for the current image . The approach uses a three-phase pipeline—Active Perception, Validation, and Explanation—augmented with GradCAM attention maps to produce an explainable scene description and hazard alerts. Experiments in the AirSim simulator across diverse environments demonstrate improved caption-image alignment and anomaly detection accuracy without fine-tuning, highlighting practical potential for safe, open-world aerial monitoring.

Abstract

Autonomous aerial monitoring is an important task aimed at gathering information from areas that may not be easily accessible by humans. At the same time, this task often requires recognizing anomalies from a significant distance or not previously encountered in the past. In this paper, we propose a novel framework that leverages the advanced capabilities provided by Large Language Models (LLMs) to actively collect information and perform anomaly detection in novel scenes. To this end, we propose an LLM based model dialogue approach, in which two deep learning models engage in a dialogue to actively control a drone to increase perception and anomaly detection accuracy. We conduct our experiments in a high fidelity simulation environment where an LLM is provided with a predetermined set of natural language movement commands mapped into executable code functions. Additionally, we deploy a multimodal Visual Question Answering (VQA) model charged with the task of visual question answering and captioning. By engaging the two models in conversation, the LLM asks exploratory questions while simultaneously flying a drone into different parts of the scene, providing a novel way to implement active perception. By leveraging LLMs reasoning ability, we output an improved detailed description of the scene going beyond existing static perception approaches. In addition to information gathering, our approach is utilized for anomaly detection and our results demonstrate the proposed methods effectiveness in informing and alerting about potential hazards.

Paper Structure

This paper contains 6 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of the proposed model dialogue approach. First a drone captures an image. This image, along with an appropriate question, is fed to the employed VQA model. Then, the VQA model provides a response that is fed to the LLM model which in turn issues a movement command and a new exploratory question.
  • Figure 2: A typical example of the operation of the proposed method. During active perception, the two models engage in a conversation and exchange information. In validation, a premature description and caption are chosen together and information is validated by revisiting the saved positions. Then, in the explanation mode, the final description and caption are provided together with attention maps.
  • Figure 3: The employed VQA architecture.
  • Figure 4: Four different environments were used for the conducted experiments: a mountain landscape, a snowy road, a public square and a lake.
  • Figure 5: Example anomalies in the four different environments. Note that some anomalies are challenging to detect and require very careful inspection of the input frame.
  • ...and 1 more figures