Table of Contents
Fetching ...

CIC: A Framework for Culturally-Aware Image Captioning

Youngsik Yun, Jihie Kim

TL;DR

This work addresses the paucity of culturally descriptive captions in image captioning by introducing Cultural Image Captioning (CIC), a three-stage framework that generates culture-centered questions, extracts cultural elements via VQA, and uses LLM prompts to produce culturally aware captions. Leveraging BLIP2 for captioning and ChatGPT for generation, CIC is evaluated on the GD-VCR dataset through qualitative, human, and automatic metrics, showing improved cultural descriptiveness over strong VLP baselines. The study also introduces a Culture Noise Rate (CNR) metric and ablation analyses to validate the impact of prompts and vocabulary extraction. While promising, the approach acknowledges biases in current models and calls for broader cultural coverage and development of non-reference evaluation methods to better capture cultural elements in images.

Abstract

Image Captioning generates descriptive sentences from images using Vision-Language Pre-trained models (VLPs) such as BLIP, which has improved greatly. However, current methods lack the generation of detailed descriptive captions for the cultural elements depicted in the images, such as the traditional clothing worn by people from Asian cultural groups. In this paper, we propose a new framework, Culturally-aware Image Captioning (CIC), that generates captions and describes cultural elements extracted from cultural visual elements in images representing cultures. Inspired by methods combining visual modality and Large Language Models (LLMs) through appropriate prompts, our framework (1) generates questions based on cultural categories from images, (2) extracts cultural visual elements from Visual Question Answering (VQA) using generated questions, and (3) generates culturally-aware captions using LLMs with the prompts. Our human evaluation conducted on 45 participants from 4 different cultural groups with a high understanding of the corresponding culture shows that our proposed framework generates more culturally descriptive captions when compared to the image captioning baseline based on VLPs. Resources can be found at https://shane3606.github.io/cic..

CIC: A Framework for Culturally-Aware Image Captioning

TL;DR

This work addresses the paucity of culturally descriptive captions in image captioning by introducing Cultural Image Captioning (CIC), a three-stage framework that generates culture-centered questions, extracts cultural elements via VQA, and uses LLM prompts to produce culturally aware captions. Leveraging BLIP2 for captioning and ChatGPT for generation, CIC is evaluated on the GD-VCR dataset through qualitative, human, and automatic metrics, showing improved cultural descriptiveness over strong VLP baselines. The study also introduces a Culture Noise Rate (CNR) metric and ablation analyses to validate the impact of prompts and vocabulary extraction. While promising, the approach acknowledges biases in current models and calls for broader cultural coverage and development of non-reference evaluation methods to better capture cultural elements in images.

Abstract

Image Captioning generates descriptive sentences from images using Vision-Language Pre-trained models (VLPs) such as BLIP, which has improved greatly. However, current methods lack the generation of detailed descriptive captions for the cultural elements depicted in the images, such as the traditional clothing worn by people from Asian cultural groups. In this paper, we propose a new framework, Culturally-aware Image Captioning (CIC), that generates captions and describes cultural elements extracted from cultural visual elements in images representing cultures. Inspired by methods combining visual modality and Large Language Models (LLMs) through appropriate prompts, our framework (1) generates questions based on cultural categories from images, (2) extracts cultural visual elements from Visual Question Answering (VQA) using generated questions, and (3) generates culturally-aware captions using LLMs with the prompts. Our human evaluation conducted on 45 participants from 4 different cultural groups with a high understanding of the corresponding culture shows that our proposed framework generates more culturally descriptive captions when compared to the image captioning baseline based on VLPs. Resources can be found at https://shane3606.github.io/cic..
Paper Structure (19 sections, 2 equations, 4 figures, 7 tables)

This paper contains 19 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Comparison of generated captions from VLPs between the same culture group and different culture groups. In the blue box, although both images belong to the Japanese cultural group, traditional Japanese clothing (i.e., kimono) is not described in the below image. In the red box are images from different cultural groups, but it is difficult to distinguish the group through the generated captions.
  • Figure 2: CIC Overview. First, culture questions are generated as described in Section \ref{['sec:generate']}. Then, cultural visual elements represented in the image are extracted through VQA as described in Section \ref{['sec:extracting']}. Finally, LLM generates culturally-aware captions as described in Section \ref{['sec:promt']}
  • Figure 3: Caption generated for the image depicted in the given 4 different cultural groups by the baseline models and our framework (CIC) in the paper. The red words in each caption represent visual elements of each culture. Compared to existing baseline models, our framework describes images more culturally.
  • Figure 4: Caption generated for the modern cultural image depicted in given 4 different cultural groups by our framework. The words that specify cultural groups are not created in the generated culturally-aware captions, making it difficult to distinguish between cultures through only captions.