Disturbing Image Detection Using LMM-Elicited Emotion Embeddings

Maria Tzelepi; Vasileios Mezaris

Disturbing Image Detection Using LMM-Elicited Emotion Embeddings

Maria Tzelepi, Vasileios Mezaris

TL;DR

This work tackles Disturbing Image Detection (DID) by exploiting knowledge encoded in Large Multimodal Models. It prompts MiniGPT-4 to produce both generic semantic descriptions and elicited emotions for each image and then encodes these textual outputs with CLIP, combining them with CLIP image embeddings to perform DID. The proposed three-stream fusion (image, semantic descriptions, and elicited emotions) achieved a top accuracy of 96.907% on the DID-Aug dataset, surpassing the CLIP-image baseline and the previous state-of-the-art. The results demonstrate that incorporating LMM-generated semantic and affective knowledge can significantly enhance safety-related vision tasks and generalize beyond standard image-based features.

Abstract

In this paper we deal with the task of Disturbing Image Detection (DID), exploiting knowledge encoded in Large Multimodal Models (LMMs). Specifically, we propose to exploit LMM knowledge in a two-fold manner: first by extracting generic semantic descriptions, and second by extracting elicited emotions. Subsequently, we use the CLIP's text encoder in order to obtain the text embeddings of both the generic semantic descriptions and LMM-elicited emotions. Finally, we use the aforementioned text embeddings along with the corresponding CLIP's image embeddings for performing the DID task. The proposed method significantly improves the baseline classification accuracy, achieving state-of-the-art performance on the augmented Disturbing Image Detection dataset.

Disturbing Image Detection Using LMM-Elicited Emotion Embeddings

TL;DR

Abstract

Paper Structure (11 sections, 3 figures, 2 tables)

This paper contains 11 sections, 3 figures, 2 tables.

Introduction
Proposed Method
MiniGPT-4
CLIP
Exploiting LMM-generated responses for DID
Experimental Evaluation
Dataset
Evaluation Metrics
Implementation Details
Experimental Results
Conclusions

Figures (3)

Figure 1: Proposed method for Disturbing Image Detection. We first prompt the MiniGPT-4 model for obtaining 10 generic semantic descriptions for each image of the dataset. We also prompt the MiniGPT-4 model for obtaining 10 elicited emotions for each image of the dataset. Then we extract the CLIP embeddings for both the MiniGPT-4-generated responses. Finally, these two text embeddings are concatenated with the corresponding CLIP image embeddings and propagated to the linear layers for performing the DID task, using cross entropy loss.
Figure 2: Semantic descriptions and elicited emotions MiniGPT-4 responses for a non-disturbing image.
Figure 3: Example of a test image that was misclassified by the baseline method, while correctly classified as disturbing using the proposed method, along with the LMM-generated responses.

Disturbing Image Detection Using LMM-Elicited Emotion Embeddings

TL;DR

Abstract

Disturbing Image Detection Using LMM-Elicited Emotion Embeddings

Authors

TL;DR

Abstract

Table of Contents

Figures (3)