Table of Contents
Fetching ...

Leveraging YOLO-World and GPT-4V LMMs for Zero-Shot Person Detection and Action Recognition in Drone Imagery

Christian Limberg, Artur Gonçalves, Bastien Rigault, Helmut Prendinger

TL;DR

The paper investigates zero-shot capabilities of large multimodal models for drone perception, focusing on person detection and action recognition. It compares YOLO-World and GPT-4V on the Okutama-Action aerial dataset, highlighting the strengths and limitations of prompt-based detection and classification in this domain. The key finding is that YOLO-World provides robust person detection while GPT-4V enhances scene understanding and filtering but struggles with precise action labeling under zero-shot settings. The study lays groundwork for integrating LMMs into rescue-drone workflows where collecting task-specific data is impractical and prompts can be quickly adapted for new objectives.

Abstract

In this article, we explore the potential of zero-shot Large Multimodal Models (LMMs) in the domain of drone perception. We focus on person detection and action recognition tasks and evaluate two prominent LMMs, namely YOLO-World and GPT-4V(ision) using a publicly available dataset captured from aerial views. Traditional deep learning approaches rely heavily on large and high-quality training datasets. However, in certain robotic settings, acquiring such datasets can be resource-intensive or impractical within a reasonable timeframe. The flexibility of prompt-based Large Multimodal Models (LMMs) and their exceptional generalization capabilities have the potential to revolutionize robotics applications in these scenarios. Our findings suggest that YOLO-World demonstrates good detection performance. GPT-4V struggles with accurately classifying action classes but delivers promising results in filtering out unwanted region proposals and in providing a general description of the scenery. This research represents an initial step in leveraging LMMs for drone perception and establishes a foundation for future investigations in this area.

Leveraging YOLO-World and GPT-4V LMMs for Zero-Shot Person Detection and Action Recognition in Drone Imagery

TL;DR

The paper investigates zero-shot capabilities of large multimodal models for drone perception, focusing on person detection and action recognition. It compares YOLO-World and GPT-4V on the Okutama-Action aerial dataset, highlighting the strengths and limitations of prompt-based detection and classification in this domain. The key finding is that YOLO-World provides robust person detection while GPT-4V enhances scene understanding and filtering but struggles with precise action labeling under zero-shot settings. The study lays groundwork for integrating LMMs into rescue-drone workflows where collecting task-specific data is impractical and prompts can be quickly adapted for new objectives.

Abstract

In this article, we explore the potential of zero-shot Large Multimodal Models (LMMs) in the domain of drone perception. We focus on person detection and action recognition tasks and evaluate two prominent LMMs, namely YOLO-World and GPT-4V(ision) using a publicly available dataset captured from aerial views. Traditional deep learning approaches rely heavily on large and high-quality training datasets. However, in certain robotic settings, acquiring such datasets can be resource-intensive or impractical within a reasonable timeframe. The flexibility of prompt-based Large Multimodal Models (LMMs) and their exceptional generalization capabilities have the potential to revolutionize robotics applications in these scenarios. Our findings suggest that YOLO-World demonstrates good detection performance. GPT-4V struggles with accurately classifying action classes but delivers promising results in filtering out unwanted region proposals and in providing a general description of the scenery. This research represents an initial step in leveraging LMMs for drone perception and establishes a foundation for future investigations in this area.
Paper Structure (9 sections, 4 figures, 2 tables)

This paper contains 9 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Sample image of the Okutama-Action dataset. The image shows eight persons, performing the actions calling, carrying, push/pulling, lying, reading, sitting, and standing.
  • Figure 2: YOLO-World detection of aerial image. We prompt the model with a single class 'Person' and loaded pre-trained weights. We uploaded a detection video of the full test dataset here: https://www.youtube.com/watch?v=QntgkMKVuVQ.
  • Figure 3: Confusion matrix for action recognition using GPT-4V.
  • Figure 4: GPT-4V classification of YOLO-World region proposals.