Imagery as Inquiry: Exploring A Multimodal Dataset for Conversational Recommendation
Se-eun Yoon, Hyunsik Jeon, Julian McAuley
TL;DR
The paper tackles image-based conversational recommendation by introducing a Reddit-derived multimodal dataset where users request book or music recommendations via images. It formalizes two tasks—title generation and multiple-choice selection—and evaluates six foundation models (two vision-language and four language-only) in zero-shot settings, using image-to-text descriptions and a chain-of-imagery prompting strategy to harness visual information. A relevance score $S_r(i)=\sum_{c\in C_r} \mathbf{1}_{c\text{ mentions }i} \cdot U(c)$ guides candidate generation and evaluation. Key findings show that large language models with detailed descriptions outperform vision-language models and that CoI prompting can unlock additional visual capabilities, highlighting both the potential and current limits of current models for mood- and aesthetics-based recommendations.
Abstract
We introduce a multimodal dataset where users express preferences through images. These images encompass a broad spectrum of visual expressions ranging from landscapes to artistic depictions. Users request recommendations for books or music that evoke similar feelings to those captured in the images, and recommendations are endorsed by the community through upvotes. This dataset supports two recommendation tasks: title generation and multiple-choice selection. Our experiments with large foundation models reveal their limitations in these tasks. Particularly, vision-language models show no significant advantage over language-only counterparts that use descriptions, which we hypothesize is due to underutilized visual capabilities. To better harness these abilities, we propose the chain-of-imagery prompting, which results in notable improvements. We release our code and datasets.
