Table of Contents
Fetching ...

MOSAIC: Multimodal Multistakeholder-aware Visual Art Recommendation

Bereket A. Yilma, Luis A. Leiva

TL;DR

Visual art recommendation is inherently multistakeholder, requiring balance between user preferences and broader ecosystem objectives. MOSAIC leverages multimodal representations from CLIP and BLIP to jointly optimize for user relevance, popularity, and representative coverage, expressed via policies and MIP formulations. Offline and user studies show popularity has a strong impact on perceived quality, while representativeness has a more limited effect; BLIP-based backbones generally outperform CLIP in user perception, indicating reduced modality gap and better semantic alignment. The work suggests MOSAIC enables learning and discovery beyond traditional personalization, with practical implications for museums, artists, and collectors, and points to future expansion to additional stakeholders and domains.

Abstract

Visual art (VA) recommendation is complex, as it has to consider the interests of users (e.g. museum visitors) and other stakeholders (e.g. museum curators). We study how to effectively account for key stakeholders in VA recommendations while also considering user-centred measures such as novelty, serendipity, and diversity. We propose MOSAIC, a novel multimodal multistakeholder-aware approach using state-of-the-art CLIP and BLIP backbone architectures and two joint optimisation objectives: popularity and representative selection of paintings across different categories. We conducted an offline evaluation using preferences elicited from 213 users followed by a user study with 100 crowdworkers. We found a strong effect of popularity, which was positively perceived by users, and a minimal effect of representativeness. MOSAIC's impact extends beyond visitors, benefiting various art stakeholders. Its user-centric approach has broader applicability, offering advancements for content recommendation across domains that require considering multiple stakeholders.

MOSAIC: Multimodal Multistakeholder-aware Visual Art Recommendation

TL;DR

Visual art recommendation is inherently multistakeholder, requiring balance between user preferences and broader ecosystem objectives. MOSAIC leverages multimodal representations from CLIP and BLIP to jointly optimize for user relevance, popularity, and representative coverage, expressed via policies and MIP formulations. Offline and user studies show popularity has a strong impact on perceived quality, while representativeness has a more limited effect; BLIP-based backbones generally outperform CLIP in user perception, indicating reduced modality gap and better semantic alignment. The work suggests MOSAIC enables learning and discovery beyond traditional personalization, with practical implications for museums, artists, and collectors, and points to future expansion to additional stakeholders and domains.

Abstract

Visual art (VA) recommendation is complex, as it has to consider the interests of users (e.g. museum visitors) and other stakeholders (e.g. museum curators). We study how to effectively account for key stakeholders in VA recommendations while also considering user-centred measures such as novelty, serendipity, and diversity. We propose MOSAIC, a novel multimodal multistakeholder-aware approach using state-of-the-art CLIP and BLIP backbone architectures and two joint optimisation objectives: popularity and representative selection of paintings across different categories. We conducted an offline evaluation using preferences elicited from 213 users followed by a user study with 100 crowdworkers. We found a strong effect of popularity, which was positively perceived by users, and a minimal effect of representativeness. MOSAIC's impact extends beyond visitors, benefiting various art stakeholders. Its user-centric approach has broader applicability, offering advancements for content recommendation across domains that require considering multiple stakeholders.
Paper Structure (34 sections, 8 equations, 7 figures, 2 tables)

This paper contains 34 sections, 8 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of our multimodal approach with CLIP to learn latent semantic representations of paintings.
  • Figure 2: Overview of our multimodal approach with BLIP to learn latent semantic representations of paintings.
  • Figure 3: Sample painting and associated metadata from the National Gallery dataset.
  • Figure 4: Offline evaluation of CLIP-based MOSAIC engines. We report Mean $\pm$ SD of the following pairwise ranking similarity measures: Jaccard index (IoU) and Rank-biased overlap (RBO) .
  • Figure 5: Offline evaluation of BLIP-based MOSAIC engines. We report Mean $\pm$ SD of the following pairwise ranking similarity measures: rank-biased overlap (RBO) and Jaccard index (IoU) .
  • ...and 2 more figures