Evaluating the Efficacy of Prompt-Engineered Large Multimodal Models Versus Fine-Tuned Vision Transformers in Image-Based Security Applications

Fouad Trad; Ali Chehab

Evaluating the Efficacy of Prompt-Engineered Large Multimodal Models Versus Fine-Tuned Vision Transformers in Image-Based Security Applications

Fouad Trad, Ali Chehab

TL;DR

This study directly compares prompt-engineered Large Multimodal Models (LMMs) against fine-tuned Vision Transformers (ViTs) for two image-based security tasks: trigger detection (visually evident) and malware classification (visually non-evident). Using MNIST with adversarial triggers and the MaleVis malware visual dataset, the authors show that a fine-tuned ViT achieves perfect performance on trigger detection and substantially outperforms LMMs on malware classification, where LMMs struggle to differentiate among fine-grained classes. Among LMMs, GPT-4o provides the strongest prompting-based results, yet still lags behind ViTs in both tasks. The findings underscore the robustness and reliability of task-specific fine-tuned ViTs for cybersecurity applications, while highlighting the conditional value and current limits of prompt-engineered LMMs in high-stakes visual analysis.

Abstract

The success of Large Language Models (LLMs) has led to a parallel rise in the development of Large Multimodal Models (LMMs), which have begun to transform a variety of applications. These sophisticated multimodal models are designed to interpret and analyze complex data by integrating multiple modalities such as text and images, thereby opening new avenues for a range of applications. This paper investigates the applicability and effectiveness of prompt-engineered LMMs that process both images and text, including models such as LLaVA, BakLLaVA, Moondream, Gemini-pro-vision, and GPT-4o, compared to fine-tuned Vision Transformer (ViT) models in addressing critical security challenges. We focus on two distinct security tasks: 1) a visually evident task of detecting simple triggers, such as small pixel variations in images that could be exploited to access potential backdoors in the models, and 2) a visually non-evident task of malware classification through visual representations. In the visually evident task, some LMMs, such as Gemini-pro-vision and GPT-4o, have demonstrated the potential to achieve good performance with careful prompt engineering, with GPT-4o achieving the highest accuracy and F1-score of 91.9\% and 91\%, respectively. However, the fine-tuned ViT models exhibit perfect performance in this task due to its simplicity. For the visually non-evident task, the results highlight a significant divergence in performance, with ViT models achieving F1-scores of 97.11\% in predicting 25 malware classes and 97.61\% in predicting 5 malware families, whereas LMMs showed suboptimal performance despite iterative prompt improvements. This study not only showcases the strengths and limitations of prompt-engineered LMMs in cybersecurity applications but also emphasizes the unmatched efficacy of fine-tuned ViT models for precise and dependable tasks.

Evaluating the Efficacy of Prompt-Engineered Large Multimodal Models Versus Fine-Tuned Vision Transformers in Image-Based Security Applications

TL;DR

Abstract

Evaluating the Efficacy of Prompt-Engineered Large Multimodal Models Versus Fine-Tuned Vision Transformers in Image-Based Security Applications

Authors

TL;DR

Abstract

Table of Contents

Figures (11)