VisualLens: Personalization through Task-Agnostic Visual History

Wang Bill Zhu; Deqing Fu; Kai Sun; Yi Lu; Zhaojiang Lin; Seungwhan Moon; Kanika Narang; Mustafa Canim; Yue Liu; Anuj Kumar; Xin Luna Dong

VisualLens: Personalization through Task-Agnostic Visual History

Wang Bill Zhu, Deqing Fu, Kai Sun, Yi Lu, Zhaojiang Lin, Seungwhan Moon, Kanika Narang, Mustafa Canim, Yue Liu, Anuj Kumar, Xin Luna Dong

TL;DR

This work addresses the limitations of relying on item-based histories by introducing VisualLens, a framework that learns personalization from a task-agnostic visual history using multimodal large language models. It constructs an offline spectrum user profile from user photos, captions, and aspect words, and uses a grid-based representation for efficient runtime reasoning and candidate matching. The approach leverages iterative refinement (caption/aspect-word enhancement) and joint multitask training, supported by two new benchmarks, Google Review-V and Yelp-V, showing consistent improvements in $Hit@3$ (5-10% over state-of-the-art item-based methods) and competitive performance against GPT-4o, with robustness to history length and unseen categories. The work highlights the potential of lifelong visual signals for broad, cross-domain personalization while outlining modular design, privacy considerations, and avenues for richer modalities in future research.

Abstract

Existing recommendation systems either rely on user interaction logs, such as online shopping history for shopping recommendations, or focus on text signals. However, item-based histories are not always accessible, and are not generalizable for multimodal recommendation. We hypothesize that a user's visual history -- comprising images from daily life -- can offer rich, task-agnostic insights into their interests and preferences, and thus be leveraged for effective personalization. To this end, we propose VisualLens, a novel framework that leverages multimodal large language models (MLLMs) to enable personalization using task-agnostic visual history. VisualLens extracts, filters, and refines a spectrum user profile from the visual history to support personalized recommendation. We created two new benchmarks, Google-Review-V and Yelp-V, with task-agnostic visual histories, and show that VisualLens improves over state-of-the-art item-based multimodal recommendations by 5-10% on Hit@3, and outperforms GPT-4o by 2-5%. Further analysis shows that VisualLens is robust across varying history lengths and excels at adapting to both longer histories and unseen content categories.

VisualLens: Personalization through Task-Agnostic Visual History

TL;DR

Abstract

VisualLens: Personalization through Task-Agnostic Visual History

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)