Table of Contents
Fetching ...

Enhancing Cross-Modal Contextual Congruence for Crowdfunding Success using Knowledge-infused Learning

Trilok Padhi, Ugur Kursuncu, Yaman Kumar, Valerie L. Shalin, Lane Peterson Fronczek

TL;DR

This work tackles the challenge of cross-modal contextual congruence in multimodal crowdfunding campaigns by integrating external commonsense knowledge from ConceptNet into compact Visual Language Models. It introduces a neurosymbolic framework that retrieves KG concepts, encodes them as KG embeddings, and fuses them with image-text representations via a multi-head cross-attention fusion module to predict campaign success. Empirical results on Kickstarter data show that knowledge-infused representations improve cross-modal alignment and predictive performance (best AUC ≈ 0.94, F1 ≈ 0.92) over baselines like MDL-TIM and standard MMBT variants, while reducing hallucinations in generated captions. The approach highlights the practical value of knowledge-grounded multimodal representations for online marketing and broader cross-modal reasoning tasks, with attention to potential noise and fairness considerations in KG retrieval.

Abstract

The digital landscape continually evolves with multimodality, enriching the online experience for users. Creators and marketers aim to weave subtle contextual cues from various modalities into congruent content to engage users with a harmonious message. This interplay of multimodal cues is often a crucial factor in attracting users' attention. However, this richness of multimodality presents a challenge to computational modeling, as the semantic contextual cues spanning across modalities need to be unified to capture the true holistic meaning of the multimodal content. This contextual meaning is critical in attracting user engagement as it conveys the intended message of the brand or the organization. In this work, we incorporate external commonsense knowledge from knowledge graphs to enhance the representation of multimodal data using compact Visual Language Models (VLMs) and predict the success of multi-modal crowdfunding campaigns. Our results show that external knowledge commonsense bridges the semantic gap between text and image modalities, and the enhanced knowledge-infused representations improve the predictive performance of models for campaign success upon the baselines without knowledge. Our findings highlight the significance of contextual congruence in online multimodal content for engaging and successful crowdfunding campaigns.

Enhancing Cross-Modal Contextual Congruence for Crowdfunding Success using Knowledge-infused Learning

TL;DR

This work tackles the challenge of cross-modal contextual congruence in multimodal crowdfunding campaigns by integrating external commonsense knowledge from ConceptNet into compact Visual Language Models. It introduces a neurosymbolic framework that retrieves KG concepts, encodes them as KG embeddings, and fuses them with image-text representations via a multi-head cross-attention fusion module to predict campaign success. Empirical results on Kickstarter data show that knowledge-infused representations improve cross-modal alignment and predictive performance (best AUC ≈ 0.94, F1 ≈ 0.92) over baselines like MDL-TIM and standard MMBT variants, while reducing hallucinations in generated captions. The approach highlights the practical value of knowledge-grounded multimodal representations for online marketing and broader cross-modal reasoning tasks, with attention to potential noise and fairness considerations in KG retrieval.

Abstract

The digital landscape continually evolves with multimodality, enriching the online experience for users. Creators and marketers aim to weave subtle contextual cues from various modalities into congruent content to engage users with a harmonious message. This interplay of multimodal cues is often a crucial factor in attracting users' attention. However, this richness of multimodality presents a challenge to computational modeling, as the semantic contextual cues spanning across modalities need to be unified to capture the true holistic meaning of the multimodal content. This contextual meaning is critical in attracting user engagement as it conveys the intended message of the brand or the organization. In this work, we incorporate external commonsense knowledge from knowledge graphs to enhance the representation of multimodal data using compact Visual Language Models (VLMs) and predict the success of multi-modal crowdfunding campaigns. Our results show that external knowledge commonsense bridges the semantic gap between text and image modalities, and the enhanced knowledge-infused representations improve the predictive performance of models for campaign success upon the baselines without knowledge. Our findings highlight the significance of contextual congruence in online multimodal content for engaging and successful crowdfunding campaigns.
Paper Structure (24 sections, 8 equations, 13 figures, 3 tables)

This paper contains 24 sections, 8 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: An example pair of image and actual caption text from our crowdfunding dataset. The caption generated by the BLIP model and the actual human-generated caption of the image is below. BLIP caption: "Two women smiling with a hand gesture of rock and roll." Actual human caption: "Smash the glass ceiling. Destroy the patriarchy. Save the record store."
  • Figure 2: t-SNE visualization of text and generated image caption embeddings as two clusters. The red dots are centroids. The two clusters get denser and the distance between them reduces when we include external knowledge.
  • Figure 3: Density Plot demonstrates the difference between the similarities (cosine) of the image and text representations with and without knowledge. The inclusion of knowledge in the input gets these modalities closer by $9.9\%$.
  • Figure 4: Our approach consists of three main components: (i) multimodal learning, (ii) knowledge retrieval and representation, and (iii) knowledge fusion layer. The retrieval component identifies the most relevant concepts from ConceptNet and their KG embeddings are generated. The knowledge fusion layer fuses the multimodal representations with knowledge embeddings followed by Softmax.
  • Figure 5: The yellow box contains the model prediction with knowledge and two baseline models. The green box shows the actual caption, BLIP caption, and retrieved concepts for each modality.
  • ...and 8 more figures