Table of Contents
Fetching ...

An Efficient Multimodal Learning Framework to Comprehend Consumer Preferences Using BERT and Cross-Attention

Junichiro Niimi

TL;DR

This paper tackles the challenge of predicting consumer preferences from multimodal data by proposing a context-aware framework that integrates textual user content with demographic/tabular context using BERT and cross-attention. The approach avoids simple feature fusion by employing a cross-attention Transformer to dynamically align text and tabular signals, and it demonstrates strong predictive performance on Yelp-derived data across Restaurant, Nightlife, and Café categories. Across extensive experiments, the context-aware model consistently outperforms six baselines, with larger pre-trained models (BERT-Large-Uncased, RoBERTa-Base) further improving accuracy, while overly large models may require more data. Practical implications include improved understanding of consumer heterogeneity and enhanced recommender capabilities, though limitations such as computational demands and fixed token windows are acknowledged and future work is proposed to address them.

Abstract

Today, the acquisition of various behavioral log data has enabled deeper understanding of customer preferences and future behaviors in the marketing field. In particular, multimodal deep learning has achieved highly accurate predictions by combining multiple types of data. Many of these studies utilize with feature fusion to construct multimodal models, which combines extracted representations from each modality. However, since feature fusion treats information from each modality equally, it is difficult to perform flexible analysis such as the attention mechanism that has been used extensively in recent years. Therefore, this study proposes a context-aware multimodal deep learning model that combines Bidirectional Encoder Representations from Transformers (BERT) and cross-attention Transformer, which dynamically changes the attention of deep-contextualized word representations based on background information such as consumer demographic and lifestyle variables. We conduct a comprehensive analysis and demonstrate the effectiveness of our model by comparing it with six reference models in three categories using behavioral logs stored on an online platform. In addition, we present an efficient multimodal learning method by comparing the learning efficiency depending on the optimizers and the prediction accuracy depending on the number of tokens in the text data.

An Efficient Multimodal Learning Framework to Comprehend Consumer Preferences Using BERT and Cross-Attention

TL;DR

This paper tackles the challenge of predicting consumer preferences from multimodal data by proposing a context-aware framework that integrates textual user content with demographic/tabular context using BERT and cross-attention. The approach avoids simple feature fusion by employing a cross-attention Transformer to dynamically align text and tabular signals, and it demonstrates strong predictive performance on Yelp-derived data across Restaurant, Nightlife, and Café categories. Across extensive experiments, the context-aware model consistently outperforms six baselines, with larger pre-trained models (BERT-Large-Uncased, RoBERTa-Base) further improving accuracy, while overly large models may require more data. Practical implications include improved understanding of consumer heterogeneity and enhanced recommender capabilities, though limitations such as computational demands and fixed token windows are acknowledged and future work is proposed to address them.

Abstract

Today, the acquisition of various behavioral log data has enabled deeper understanding of customer preferences and future behaviors in the marketing field. In particular, multimodal deep learning has achieved highly accurate predictions by combining multiple types of data. Many of these studies utilize with feature fusion to construct multimodal models, which combines extracted representations from each modality. However, since feature fusion treats information from each modality equally, it is difficult to perform flexible analysis such as the attention mechanism that has been used extensively in recent years. Therefore, this study proposes a context-aware multimodal deep learning model that combines Bidirectional Encoder Representations from Transformers (BERT) and cross-attention Transformer, which dynamically changes the attention of deep-contextualized word representations based on background information such as consumer demographic and lifestyle variables. We conduct a comprehensive analysis and demonstrate the effectiveness of our model by comparing it with six reference models in three categories using behavioral logs stored on an online platform. In addition, we present an efficient multimodal learning method by comparing the learning efficiency depending on the optimizers and the prediction accuracy depending on the number of tokens in the text data.
Paper Structure (19 sections, 4 equations, 6 figures, 7 tables)

This paper contains 19 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: From Attention to Transformer
  • Figure 2: Amount of information in the review text
  • Figure 3: Context-Aware Model
  • Figure 4: Reference models (multimodal)
  • Figure 5: Reference models (monomodal)
  • ...and 1 more figures