Table of Contents
Fetching ...

The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning

Longju Bai, Angana Borah, Oana Ignat, Rada Mihalcea

TL;DR

This study evaluates the collective performance of LMMs in a multi-agent interaction setting for the novel task of cultural image captioning and introduces MosAIC, a Multi-Agent framework to enhance cross-cultural Image Captioning using LMMs with distinct cultural personas.

Abstract

Large Multimodal Models (LMMs) exhibit impressive performance across various multimodal tasks. However, their effectiveness in cross-cultural contexts remains limited due to the predominantly Western-centric nature of most data and models. Conversely, multi-agent models have shown significant capability in solving complex tasks. Our study evaluates the collective performance of LMMs in a multi-agent interaction setting for the novel task of cultural image captioning. Our contributions are as follows: (1) We introduce MosAIC, a Multi-Agent framework to enhance cross-cultural Image Captioning using LMMs with distinct cultural personas; (2) We provide a dataset of culturally enriched image captions in English for images from China, India, and Romania across three datasets: GeoDE, GD-VCR, CVQA; (3) We propose a culture-adaptable metric for evaluating cultural information within image captions; and (4) We show that the multi-agent interaction outperforms single-agent models across different metrics, and offer valuable insights for future research. Our dataset and models can be accessed at https://github.com/MichiganNLP/MosAIC.

The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning

TL;DR

This study evaluates the collective performance of LMMs in a multi-agent interaction setting for the novel task of cultural image captioning and introduces MosAIC, a Multi-Agent framework to enhance cross-cultural Image Captioning using LMMs with distinct cultural personas.

Abstract

Large Multimodal Models (LMMs) exhibit impressive performance across various multimodal tasks. However, their effectiveness in cross-cultural contexts remains limited due to the predominantly Western-centric nature of most data and models. Conversely, multi-agent models have shown significant capability in solving complex tasks. Our study evaluates the collective performance of LMMs in a multi-agent interaction setting for the novel task of cultural image captioning. Our contributions are as follows: (1) We introduce MosAIC, a Multi-Agent framework to enhance cross-cultural Image Captioning using LMMs with distinct cultural personas; (2) We provide a dataset of culturally enriched image captions in English for images from China, India, and Romania across three datasets: GeoDE, GD-VCR, CVQA; (3) We propose a culture-adaptable metric for evaluating cultural information within image captions; and (4) We show that the multi-agent interaction outperforms single-agent models across different metrics, and offer valuable insights for future research. Our dataset and models can be accessed at https://github.com/MichiganNLP/MosAIC.

Paper Structure

This paper contains 51 sections, 26 figures, 3 tables.

Figures (26)

  • Figure 1: In a multi-agent setting, three LMM agents, each embodying a curious and drawing upon knowledge from distinct countries (India, China, and Romania), participate in a question-and-answer dialogue centered around an image. A fourth agent then summarizes their discussion, creating a culturally enriched image caption.
  • Figure 2: Overview of MosAIC, our proposed framework for Multi-Agent Image Captioning. The framework consists of a multi-agent interaction model, cultural benchmarks and evaluation metrics. The input is an image and the output is a cultural image caption.
  • Figure 3: Multi-Agent Interaction Model. The Moderator presents questions to the Social agents, who engage in three conversation rounds. The Summarizer creates the final image caption by compiling the conversation summaries from the Social agents.
  • Figure 4: Human Annotation Guidelines for Cultural Image Captioning.
  • Figure 5: Our interaction-based model, MosAIC, surpasses non-interaction models and Humans on Completeness and Cultural Info while performing on par with the other models in Alignment. For clarity, the Alignment and Completeness scores are normalized to a [0,1] scale, whereas the Cultural Info score ranges from 0 to the total number of words in a caption.
  • ...and 21 more figures