Table of Contents
Fetching ...

MIP-GAF: A MLLM-annotated Benchmark for Most Important Person Localization and Group Context Understanding

Surbhi Madan, Shreya Ghosh, Lownish Rai Sookha, M. A. Ganaie, Ramanathan Subramanian, Abhinav Dhall, Tom Gedeon

TL;DR

A comprehensive benchmarking of the proposed dataset utilizing state-of-the-art MIP localization methods, indicating a significant drop in performance compared to existing datasets, shows that the existing MIP localization algorithms must be more robust with respect to ‘in-the-wild’ situations.

Abstract

Estimating the Most Important Person (MIP) in any social event setup is a challenging problem mainly due to contextual complexity and scarcity of labeled data. Moreover, the causality aspects of MIP estimation are quite subjective and diverse. To this end, we aim to address the problem by annotating a large-scale `in-the-wild' dataset for identifying human perceptions about the `Most Important Person (MIP)' in an image. The paper provides a thorough description of our proposed Multimodal Large Language Model (MLLM) based data annotation strategy, and a thorough data quality analysis. Further, we perform a comprehensive benchmarking of the proposed dataset utilizing state-of-the-art MIP localization methods, indicating a significant drop in performance compared to existing datasets. The performance drop shows that the existing MIP localization algorithms must be more robust with respect to `in-the-wild' situations. We believe the proposed dataset will play a vital role in building the next-generation social situation understanding methods. The code and data is available at https://github.com/surbhimadan92/MIP-GAF.

MIP-GAF: A MLLM-annotated Benchmark for Most Important Person Localization and Group Context Understanding

TL;DR

A comprehensive benchmarking of the proposed dataset utilizing state-of-the-art MIP localization methods, indicating a significant drop in performance compared to existing datasets, shows that the existing MIP localization algorithms must be more robust with respect to ‘in-the-wild’ situations.

Abstract

Estimating the Most Important Person (MIP) in any social event setup is a challenging problem mainly due to contextual complexity and scarcity of labeled data. Moreover, the causality aspects of MIP estimation are quite subjective and diverse. To this end, we aim to address the problem by annotating a large-scale `in-the-wild' dataset for identifying human perceptions about the `Most Important Person (MIP)' in an image. The paper provides a thorough description of our proposed Multimodal Large Language Model (MLLM) based data annotation strategy, and a thorough data quality analysis. Further, we perform a comprehensive benchmarking of the proposed dataset utilizing state-of-the-art MIP localization methods, indicating a significant drop in performance compared to existing datasets. The performance drop shows that the existing MIP localization algorithms must be more robust with respect to `in-the-wild' situations. We believe the proposed dataset will play a vital role in building the next-generation social situation understanding methods. The code and data is available at https://github.com/surbhimadan92/MIP-GAF.
Paper Structure (9 sections, 6 figures, 5 tables)

This paper contains 9 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of data statistics. Left. Overview of train, validation and test set splits. Second Left. Ethnicity distribution over the three splits. Second Right. Age distribution of the detected persons. Right. Per image detected face distribution.
  • Figure 2: Data Annotation Pipeline. Overview of our data labelling paradigm. We bring the concept of 'human-in-the-loop' annotation. We initialize the annotation process with MLLM-based annotation followed by a label-refining strategy with human annotators.
  • Figure 3: User study results. Left. Human agreement analysis over images. Right. We show the dataset-specific level of difficulty in spotting the MIP. The plot shows that the MIP is easily spottable for the MS dataset. Our proposed dataset, MIP-GAF, is more difficult than MS and NCAA.
  • Figure 4: Our proposed MIP-CLIP framework. Stage 1: It learns to classify text inputs and uses positive expressions to locate the MIP on response maps. Stage 2: Trained image and text encoders generate feature maps, and a fusion model localizes MIP using response maps.
  • Figure 5: Qualitative Analysis. We compare the output of different off-the-shelf methods on MS, NCAA, and MIP-GAF datasets. Here, the dotted line(green) indicates the predicted bounding box and the solid line (red) bounding box indicates the ground truth.
  • ...and 1 more figures