Table of Contents
Fetching ...

Exploring Social Media Image Categorization Using Large Models with Different Adaptation Methods: A Case Study on Cultural Nature's Contributions to People

Rohaifa Khaldi, Domingo Alcaraz-Segura, Ignacio Sánchez-Herrera, Javier Martinez-Lopez, Carlos Javier Navarro, Siham Tabik

TL;DR

The study addresses the challenge of classifying social media imagery into Cultural Ecosystem Services categories by leveraging large models (LLMs, LVMs, LVLMs) with diverse adaptation strategies. It introduces the FLIPS Flickr-based CES dataset and evaluates five model configurations across supervised, in-context, and hybrid unsupervised + in-context paradigms. Key findings show DINOv2 (LVM) with lightweight fine-tuning achieving the highest accuracy (~97%), while LVLMs like GPT-4o excel in zero-shot prompting, with a notable gap that can be bridged by hybrid approaches; ensemble strategies could optimize performance across categories but at higher computational and environmental costs. The work demonstrates practical viability of large-model approaches for CES image categorization and suggests future work on finer-grained categories and incorporating multimodal data for richer understanding and improved performance.

Abstract

Social media images provide valuable insights for modeling, mapping, and understanding human interactions with natural and cultural heritage. However, categorizing these images into semantically meaningful groups remains highly complex due to the vast diversity and heterogeneity of their visual content as they contain an open-world human and nature elements. This challenge becomes greater when categories involve abstract concepts and lack consistent visual patterns. Related studies involve human supervision in the categorization process and the lack of public benchmark datasets make comparisons between these works unfeasible. On the other hand, the continuous advances in large models, including Large Language Models (LLMs), Large Visual Models (LVMs), and Large Visual Language Models (LVLMs), provide a large space of unexplored solutions. In this work 1) we introduce FLIPS a dataset of Flickr images that capture the interaction between human and nature, and 2) evaluate various solutions based on different types and combinations of large models using various adaptation methods. We assess and report their performance in terms of cost, productivity, scalability, and result quality to address the challenges of social media image categorization.

Exploring Social Media Image Categorization Using Large Models with Different Adaptation Methods: A Case Study on Cultural Nature's Contributions to People

TL;DR

The study addresses the challenge of classifying social media imagery into Cultural Ecosystem Services categories by leveraging large models (LLMs, LVMs, LVLMs) with diverse adaptation strategies. It introduces the FLIPS Flickr-based CES dataset and evaluates five model configurations across supervised, in-context, and hybrid unsupervised + in-context paradigms. Key findings show DINOv2 (LVM) with lightweight fine-tuning achieving the highest accuracy (~97%), while LVLMs like GPT-4o excel in zero-shot prompting, with a notable gap that can be bridged by hybrid approaches; ensemble strategies could optimize performance across categories but at higher computational and environmental costs. The work demonstrates practical viability of large-model approaches for CES image categorization and suggests future work on finer-grained categories and incorporating multimodal data for richer understanding and improved performance.

Abstract

Social media images provide valuable insights for modeling, mapping, and understanding human interactions with natural and cultural heritage. However, categorizing these images into semantically meaningful groups remains highly complex due to the vast diversity and heterogeneity of their visual content as they contain an open-world human and nature elements. This challenge becomes greater when categories involve abstract concepts and lack consistent visual patterns. Related studies involve human supervision in the categorization process and the lack of public benchmark datasets make comparisons between these works unfeasible. On the other hand, the continuous advances in large models, including Large Language Models (LLMs), Large Visual Models (LVMs), and Large Visual Language Models (LVLMs), provide a large space of unexplored solutions. In this work 1) we introduce FLIPS a dataset of Flickr images that capture the interaction between human and nature, and 2) evaluate various solutions based on different types and combinations of large models using various adaptation methods. We assess and report their performance in terms of cost, productivity, scalability, and result quality to address the challenges of social media image categorization.
Paper Structure (11 sections, 7 figures, 8 tables)

This paper contains 11 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Six examples of each CES category.
  • Figure 2: High-level illustration of the five proposed approaches, indicated as (1), (2), (3), (4) and (5), to address social media images categorization using large models. See Table \ref{['tab1']} for more information about the type of the used models and adaptation methods.
  • Figure 3: Illustration of Approach (4): Supervised learning-based solution involving LVM adapted using lightweight fine-tuning method. Frozen implies that the model was directly applied without any training.
  • Figure 4: Illustration of Approach (2): Supervised learning-based solution involving LVLM combined with LLM adapted using lightweight fine-tuning method. Frozen implies that the model was directly applied without any training.
  • Figure 5: Illustration of Approach (3): In-context learning-based solution using LVLM adapted with prompting method. Frozen implies that the model was directly applied without any training.
  • ...and 2 more figures