Exploring Social Media Image Categorization Using Large Models with Different Adaptation Methods: A Case Study on Cultural Nature's Contributions to People
Rohaifa Khaldi, Domingo Alcaraz-Segura, Ignacio Sánchez-Herrera, Javier Martinez-Lopez, Carlos Javier Navarro, Siham Tabik
TL;DR
The study addresses the challenge of classifying social media imagery into Cultural Ecosystem Services categories by leveraging large models (LLMs, LVMs, LVLMs) with diverse adaptation strategies. It introduces the FLIPS Flickr-based CES dataset and evaluates five model configurations across supervised, in-context, and hybrid unsupervised + in-context paradigms. Key findings show DINOv2 (LVM) with lightweight fine-tuning achieving the highest accuracy (~97%), while LVLMs like GPT-4o excel in zero-shot prompting, with a notable gap that can be bridged by hybrid approaches; ensemble strategies could optimize performance across categories but at higher computational and environmental costs. The work demonstrates practical viability of large-model approaches for CES image categorization and suggests future work on finer-grained categories and incorporating multimodal data for richer understanding and improved performance.
Abstract
Social media images provide valuable insights for modeling, mapping, and understanding human interactions with natural and cultural heritage. However, categorizing these images into semantically meaningful groups remains highly complex due to the vast diversity and heterogeneity of their visual content as they contain an open-world human and nature elements. This challenge becomes greater when categories involve abstract concepts and lack consistent visual patterns. Related studies involve human supervision in the categorization process and the lack of public benchmark datasets make comparisons between these works unfeasible. On the other hand, the continuous advances in large models, including Large Language Models (LLMs), Large Visual Models (LVMs), and Large Visual Language Models (LVLMs), provide a large space of unexplored solutions. In this work 1) we introduce FLIPS a dataset of Flickr images that capture the interaction between human and nature, and 2) evaluate various solutions based on different types and combinations of large models using various adaptation methods. We assess and report their performance in terms of cost, productivity, scalability, and result quality to address the challenges of social media image categorization.
