Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models

Yasiru Ranasinghe; Vibashan VS; James Uplinger; Celso De Melo; Vishal M. Patel

Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models

Yasiru Ranasinghe, Vibashan VS, James Uplinger, Celso De Melo, Vishal M. Patel

TL;DR

This work tackles zero-shot automatic target recognition in novel environments by fusing open-world detectors with large vision-language models. It presents a cascaded two-stage pipeline where a binary detector localizes candidate objects and an LVLM reevaluates labels using open-set, closed-set, or Chain-of-Thought prompting. Findings show API LVLMs offer superior recognition, CoT prompting enhances labeling under challenging conditions, and the approach enables false-positive reduction and robust ATR across RGB, grayscale, and thermal modalities. The study also analyzes factors like distance, modality, and prompting strategies, outlining practical pathways for safer and more reliable ATR in dynamic domains.

Abstract

Automatic target recognition (ATR) plays a critical role in tasks such as navigation and surveillance, where safety and accuracy are paramount. In extreme use cases, such as military applications, these factors are often challenged due to the presence of unknown terrains, environmental conditions, and novel object categories. Current object detectors, including open-world detectors, lack the ability to confidently recognize novel objects or operate in unknown environments, as they have not been exposed to these new conditions. However, Large Vision-Language Models (LVLMs) exhibit emergent properties that enable them to recognize objects in varying conditions in a zero-shot manner. Despite this, LVLMs struggle to localize objects effectively within a scene. To address these limitations, we propose a novel pipeline that combines the detection capabilities of open-world detectors with the recognition confidence of LVLMs, creating a robust system for zero-shot ATR of novel classes and unknown domains. In this study, we compare the performance of various LVLMs for recognizing military vehicles, which are often underrepresented in training datasets. Additionally, we examine the impact of factors such as distance range, modality, and prompting methods on the recognition performance, providing insights into the development of more reliable ATR systems for novel conditions and classes.

Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models

TL;DR

Abstract

Paper Structure (10 sections, 6 figures, 4 tables)

This paper contains 10 sections, 6 figures, 4 tables.

INTRODUCTION
Related Work
Proposed Pipeline
Detection phase
Reevaluation phase
Experimental Settings
Datasets
Vision-language models
Results
Conclusion

Figures (6)

Figure 1: Comparison between existing architectures zero-shot text prompted automatic target recognition (ATR). Standard open-world ATR involves a human-in-the-loop as the novel objects to be detected and recognized should be provided to the detector. Even then, the state-of-the-art open-world ATR systems fail to recognize novel object classes that extremely deviate from training classes. In LLM-based ATR, the detector is only used at the capacity of localizing the objects present in the image. Then, each localized object is sent to a larger vision-language model to recognize the object, which eliminates the need for user interference.
Figure 2: The proposed for ATR using LVLMs. First, in the 'Detection phase,' the image is passed through the object detector for binary detection, where the objects in the scene are detected to produce crops. Then, these crops are sent to the LVLM to recognize the object label in the 'Reevaluation phase.'
Figure 3: Sample images from the datasets depicting differences between the conditions tested for automatic target recognition. Left top: near object from the DSIAC dataset with clear visibility. Right top: far object from the DSIAC dataset with difficult visibility. Bottom left: thermal image from ADAS dataset illustrating the deviation from natural images. Bottom right: sample synthetic image from AIS dataset for OOD samples.
Figure 4: Misrecognition by open-world detectors for novel object categories (first column) and the localization performance of binary detection (second column) compared to using a keyword vocabulary.
Figure 5: The pipeline can be used to remove false positives (left image) produced by the detector. The Chain-of-thought recognition on the thermal image illustrates the attributes used to label the object.
...and 1 more figures

Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models

TL;DR

Abstract

Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)