Table of Contents
Fetching ...

VisualLens: Personalization through Task-Agnostic Visual History

Wang Bill Zhu, Deqing Fu, Kai Sun, Yi Lu, Zhaojiang Lin, Seungwhan Moon, Kanika Narang, Mustafa Canim, Yue Liu, Anuj Kumar, Xin Luna Dong

TL;DR

This work addresses the limitations of relying on item-based histories by introducing VisualLens, a framework that learns personalization from a task-agnostic visual history using multimodal large language models. It constructs an offline spectrum user profile from user photos, captions, and aspect words, and uses a grid-based representation for efficient runtime reasoning and candidate matching. The approach leverages iterative refinement (caption/aspect-word enhancement) and joint multitask training, supported by two new benchmarks, Google Review-V and Yelp-V, showing consistent improvements in $Hit@3$ (5-10% over state-of-the-art item-based methods) and competitive performance against GPT-4o, with robustness to history length and unseen categories. The work highlights the potential of lifelong visual signals for broad, cross-domain personalization while outlining modular design, privacy considerations, and avenues for richer modalities in future research.

Abstract

Existing recommendation systems either rely on user interaction logs, such as online shopping history for shopping recommendations, or focus on text signals. However, item-based histories are not always accessible, and are not generalizable for multimodal recommendation. We hypothesize that a user's visual history -- comprising images from daily life -- can offer rich, task-agnostic insights into their interests and preferences, and thus be leveraged for effective personalization. To this end, we propose VisualLens, a novel framework that leverages multimodal large language models (MLLMs) to enable personalization using task-agnostic visual history. VisualLens extracts, filters, and refines a spectrum user profile from the visual history to support personalized recommendation. We created two new benchmarks, Google-Review-V and Yelp-V, with task-agnostic visual histories, and show that VisualLens improves over state-of-the-art item-based multimodal recommendations by 5-10% on Hit@3, and outperforms GPT-4o by 2-5%. Further analysis shows that VisualLens is robust across varying history lengths and excels at adapting to both longer histories and unseen content categories.

VisualLens: Personalization through Task-Agnostic Visual History

TL;DR

This work addresses the limitations of relying on item-based histories by introducing VisualLens, a framework that learns personalization from a task-agnostic visual history using multimodal large language models. It constructs an offline spectrum user profile from user photos, captions, and aspect words, and uses a grid-based representation for efficient runtime reasoning and candidate matching. The approach leverages iterative refinement (caption/aspect-word enhancement) and joint multitask training, supported by two new benchmarks, Google Review-V and Yelp-V, showing consistent improvements in (5-10% over state-of-the-art item-based methods) and competitive performance against GPT-4o, with robustness to history length and unseen categories. The work highlights the potential of lifelong visual signals for broad, cross-domain personalization while outlining modular design, privacy considerations, and avenues for richer modalities in future research.

Abstract

Existing recommendation systems either rely on user interaction logs, such as online shopping history for shopping recommendations, or focus on text signals. However, item-based histories are not always accessible, and are not generalizable for multimodal recommendation. We hypothesize that a user's visual history -- comprising images from daily life -- can offer rich, task-agnostic insights into their interests and preferences, and thus be leveraged for effective personalization. To this end, we propose VisualLens, a novel framework that leverages multimodal large language models (MLLMs) to enable personalization using task-agnostic visual history. VisualLens extracts, filters, and refines a spectrum user profile from the visual history to support personalized recommendation. We created two new benchmarks, Google-Review-V and Yelp-V, with task-agnostic visual histories, and show that VisualLens improves over state-of-the-art item-based multimodal recommendations by 5-10% on Hit@3, and outperforms GPT-4o by 2-5%. Further analysis shows that VisualLens is robust across varying history lengths and excels at adapting to both longer histories and unseen content categories.

Paper Structure

This paper contains 39 sections, 2 equations, 11 figures, 9 tables, 1 algorithm.

Figures (11)

  • Figure 1: VisualLens leverages a user's task-agnostic visual history to provide personalized recommendations. Our method outperforms GPT-4o by 1.6%$\sim$4.6% on Hit@3.
  • Figure 2: VisualLens inference pipeline: the offline process augments images with captions and aspect words to generate a spectrum user profile; the runtime recommendation process retrieves relevant images, generate query-specific user profile accordingly, and then predict candidate preferences.
  • Figure 3: (a) MRR distribution over number of candidates, (b) MRR distribution over number of images. Both are on the User ID test set. We find (1) MRR converges when number of candidates exceeds 50; (2) MRR increases and flattens after reaching $\sim$100 images.
  • Figure 4: (a) MRR distribution over categories on Google Review-V, (b) MRR distribution over categories on Yelp-V. We find (1) the performance per category is loosely correlated with number of training data; (2) when a category is more general and less ambiguous, the performance on the category is better.
  • Figure 5: The Google Review-Vision (Google Review-V) training data consists of 66 categories.
  • ...and 6 more figures