Table of Contents
Fetching ...

Imagery as Inquiry: Exploring A Multimodal Dataset for Conversational Recommendation

Se-eun Yoon, Hyunsik Jeon, Julian McAuley

TL;DR

The paper tackles image-based conversational recommendation by introducing a Reddit-derived multimodal dataset where users request book or music recommendations via images. It formalizes two tasks—title generation and multiple-choice selection—and evaluates six foundation models (two vision-language and four language-only) in zero-shot settings, using image-to-text descriptions and a chain-of-imagery prompting strategy to harness visual information. A relevance score $S_r(i)=\sum_{c\in C_r} \mathbf{1}_{c\text{ mentions }i} \cdot U(c)$ guides candidate generation and evaluation. Key findings show that large language models with detailed descriptions outperform vision-language models and that CoI prompting can unlock additional visual capabilities, highlighting both the potential and current limits of current models for mood- and aesthetics-based recommendations.

Abstract

We introduce a multimodal dataset where users express preferences through images. These images encompass a broad spectrum of visual expressions ranging from landscapes to artistic depictions. Users request recommendations for books or music that evoke similar feelings to those captured in the images, and recommendations are endorsed by the community through upvotes. This dataset supports two recommendation tasks: title generation and multiple-choice selection. Our experiments with large foundation models reveal their limitations in these tasks. Particularly, vision-language models show no significant advantage over language-only counterparts that use descriptions, which we hypothesize is due to underutilized visual capabilities. To better harness these abilities, we propose the chain-of-imagery prompting, which results in notable improvements. We release our code and datasets.

Imagery as Inquiry: Exploring A Multimodal Dataset for Conversational Recommendation

TL;DR

The paper tackles image-based conversational recommendation by introducing a Reddit-derived multimodal dataset where users request book or music recommendations via images. It formalizes two tasks—title generation and multiple-choice selection—and evaluates six foundation models (two vision-language and four language-only) in zero-shot settings, using image-to-text descriptions and a chain-of-imagery prompting strategy to harness visual information. A relevance score guides candidate generation and evaluation. Key findings show that large language models with detailed descriptions outperform vision-language models and that CoI prompting can unlock additional visual capabilities, highlighting both the potential and current limits of current models for mood- and aesthetics-based recommendations.

Abstract

We introduce a multimodal dataset where users express preferences through images. These images encompass a broad spectrum of visual expressions ranging from landscapes to artistic depictions. Users request recommendations for books or music that evoke similar feelings to those captured in the images, and recommendations are endorsed by the community through upvotes. This dataset supports two recommendation tasks: title generation and multiple-choice selection. Our experiments with large foundation models reveal their limitations in these tasks. Particularly, vision-language models show no significant advantage over language-only counterparts that use descriptions, which we hypothesize is due to underutilized visual capabilities. To better harness these abilities, we propose the chain-of-imagery prompting, which results in notable improvements. We release our code and datasets.
Paper Structure (8 sections, 3 figures, 4 tables)

This paper contains 8 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Users seeking recommendations through imagery. Each request uses a set of image(s) that conveys the essence of music (left) or book (right) one is looking for.
  • Figure 2: We construct title generation (above) and multiple-choice selection (below) tasks, based on our dataset containing requests and recommendations.
  • Figure 3: Larger models (left) benefit from detailed descriptions, but smaller models (right) do not. Results for books are shown; results for music are similar.