InteraRec: Screenshot Based Recommendations Using Multimodal Large Language Models

Saketh Reddy Karra; Theja Tulabandhula

InteraRec: Screenshot Based Recommendations Using Multimodal Large Language Models

Saketh Reddy Karra, Theja Tulabandhula

TL;DR

InteraRec presents a novel screenshot-based recommender that leverages multimodal large language models to extract user preferences from visual browsing data, addressing interpretability and real-time personalization limitations of weblog-based systems. The framework uses a three-stage pipeline of screenshot generation, behavioral summarization, and response generation, with an optimization-backed mechanism to produce personalized recommendations, and a session-based re-ranking variant to leverage context. A new Amazon screenshot dataset is created to validate the approach, and experiments show improvements in ranking metrics across several baselines, along with insights on data volume, input modality, and session length. Overall, the work demonstrates the practical potential of visual data and LLM-driven reasoning to advance real-time, personalized recommendations in e-commerce.

Abstract

Weblogs, comprised of records detailing user activities on any website, offer valuable insights into user preferences, behavior, and interests. Numerous recommendation algorithms, employing strategies such as collaborative filtering, content-based filtering, and hybrid methods, leverage the data mined through these weblogs to provide personalized recommendations to users. Despite the abundance of information available in these weblogs, identifying and extracting pertinent information and key features from them necessitate extensive engineering endeavors. The intricate nature of the data also poses a challenge for interpretation, especially for non-experts. In this study, we introduce a sophisticated and interactive recommendation framework denoted as InteraRec, which diverges from conventional approaches that exclusively depend on weblogs for recommendation generation. InteraRec framework captures high-frequency screenshots of web pages as users navigate through a website. Leveraging state-of-the-art multimodal large language models (MLLMs), it extracts valuable insights into user preferences from these screenshots by generating a textual summary based on predefined keywords. Subsequently, an LLM-integrated optimization setup utilizes this summary to generate tailored recommendations. Through our experiments, we demonstrate the effectiveness of InteraRec in providing users with valuable and personalized offerings. Furthermore, we explore the integration of session-based recommendation systems into the InteraRec framework, aiming to enhance its overall performance. Finally, we curate a new dataset comprising of screenshots from product web pages on the Amazon website for the validation of the InteraRec framework. Detailed experiments demonstrate the efficacy of the InteraRec framework in delivering valuable and personalized recommendations tailored to individual user preferences.

InteraRec: Screenshot Based Recommendations Using Multimodal Large Language Models

TL;DR

Abstract

InteraRec: Screenshot Based Recommendations Using Multimodal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (19)