Table of Contents
Fetching ...

UMAIR-FPS: User-aware Multi-modal Animation Illustration Recommendation Fusion with Painting Style

Yan Kang, Hao Lin, Mingjian Yang, Shin-Jye Lee

TL;DR

UMAIR-FPS addresses anime illustration recommendation by fusing domain-specific image and text modalities and adapting to individual user preferences. It introduces a dual-output image encoder that captures semantic content and painting style, and a multi-perspective text encoder fine-tuned on domain knowledge to produce $e^{TSEM}$. A user-aware multi-modal contribution measurement (UMCM) reweights multimodal features via attention, and DCN-V2 enables bounded-degree cross-modal interactions to model complex user-item influences. Evaluations on a real-world dataset show significant gains in AUC and BCE loss over baselines, with ablations confirming the importance of each component. The approach advances personalized, multimodal recommendations for anime illustrations and provides reusable components for other domains with strong image-text coupling.

Abstract

The rapid advancement of high-quality image generation models based on AI has generated a deluge of anime illustrations. Recommending illustrations to users within massive data has become a challenging and popular task. However, existing anime recommendation systems have focused on text features but still need to integrate image features. In addition, most multi-modal recommendation research is constrained by tightly coupled datasets, limiting its applicability to anime illustrations. We propose the User-aware Multi-modal Animation Illustration Recommendation Fusion with Painting Style (UMAIR-FPS) to tackle these gaps. In the feature extract phase, for image features, we are the first to combine image painting style features with semantic features to construct a dual-output image encoder for enhancing representation. For text features, we obtain text embeddings based on fine-tuning Sentence-Transformers by incorporating domain knowledge that composes a variety of domain text pairs from multilingual mappings, entity relationships, and term explanation perspectives, respectively. In the multi-modal fusion phase, we novelly propose a user-aware multi-modal contribution measurement mechanism to weight multi-modal features dynamically according to user features at the interaction level and employ the DCN-V2 module to model bounded-degree multi-modal crosses effectively. UMAIR-FPS surpasses the stat-of-the-art baselines on large real-world datasets, demonstrating substantial performance enhancements.

UMAIR-FPS: User-aware Multi-modal Animation Illustration Recommendation Fusion with Painting Style

TL;DR

UMAIR-FPS addresses anime illustration recommendation by fusing domain-specific image and text modalities and adapting to individual user preferences. It introduces a dual-output image encoder that captures semantic content and painting style, and a multi-perspective text encoder fine-tuned on domain knowledge to produce . A user-aware multi-modal contribution measurement (UMCM) reweights multimodal features via attention, and DCN-V2 enables bounded-degree cross-modal interactions to model complex user-item influences. Evaluations on a real-world dataset show significant gains in AUC and BCE loss over baselines, with ablations confirming the importance of each component. The approach advances personalized, multimodal recommendations for anime illustrations and provides reusable components for other domains with strong image-text coupling.

Abstract

The rapid advancement of high-quality image generation models based on AI has generated a deluge of anime illustrations. Recommending illustrations to users within massive data has become a challenging and popular task. However, existing anime recommendation systems have focused on text features but still need to integrate image features. In addition, most multi-modal recommendation research is constrained by tightly coupled datasets, limiting its applicability to anime illustrations. We propose the User-aware Multi-modal Animation Illustration Recommendation Fusion with Painting Style (UMAIR-FPS) to tackle these gaps. In the feature extract phase, for image features, we are the first to combine image painting style features with semantic features to construct a dual-output image encoder for enhancing representation. For text features, we obtain text embeddings based on fine-tuning Sentence-Transformers by incorporating domain knowledge that composes a variety of domain text pairs from multilingual mappings, entity relationships, and term explanation perspectives, respectively. In the multi-modal fusion phase, we novelly propose a user-aware multi-modal contribution measurement mechanism to weight multi-modal features dynamically according to user features at the interaction level and employ the DCN-V2 module to model bounded-degree multi-modal crosses effectively. UMAIR-FPS surpasses the stat-of-the-art baselines on large real-world datasets, demonstrating substantial performance enhancements.
Paper Structure (12 sections, 7 equations, 5 figures, 4 tables)

This paper contains 12 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a): Same content but different painting styles; (b): Image with labels.
  • Figure 2: The overview architecture of the proposed UMAIR-FPS model.
  • Figure 3: Dual-output image encoder & Multi-perspective text encoder.
  • Figure 4: (a)/(b): Interaction-based Illustration/User statistics.
  • Figure 5: Compare $e^\text{SEM}$ between (a) & (b), and compare $e^\text{STY}$ between (c) & (d).