Table of Contents
Fetching ...

Exploring Recommendation Capabilities of GPT-4V(ision): A Preliminary Case Study

Peilin Zhou, Meng Cao, You-Liang Huang, Qichen Ye, Peiyan Zhang, Junling Liu, Yueqi Xie, Yining Hua, Jaeboum Kim

TL;DR

This study investigates GPT-4V(ision) as a multimodal recommender by constructing qualitative test cases across culture, art, entertainment, and retail to probe zero-shot recommendations, explanations, and multi-image integration. It demonstrates robust image-text understanding and adaptability without domain-specific training, while highlighting limitations such as prompt sensitivity and limited diversity in some scenarios. The authors discuss evaluation constraints (no quantitative benchmarks due to API access) and ethical considerations, proposing future work on benchmarking, interactive prompts, and safety frameworks. Overall, the work suggests significant potential for next-generation multimodal generative recommender models and emphasizes the need for systematic benchmarks and robust, user-feedback-driven development to improve diversity, interactivity, and reliability in real-world deployments.

Abstract

Large Multimodal Models (LMMs) have demonstrated impressive performance across various vision and language tasks, yet their potential applications in recommendation tasks with visual assistance remain unexplored. To bridge this gap, we present a preliminary case study investigating the recommendation capabilities of GPT-4V(ison), a recently released LMM by OpenAI. We construct a series of qualitative test samples spanning multiple domains and employ these samples to assess the quality of GPT-4V's responses within recommendation scenarios. Evaluation results on these test samples prove that GPT-4V has remarkable zero-shot recommendation abilities across diverse domains, thanks to its robust visual-text comprehension capabilities and extensive general knowledge. However, we have also identified some limitations in using GPT-4V for recommendations, including a tendency to provide similar responses when given similar inputs. This report concludes with an in-depth discussion of the challenges and research opportunities associated with utilizing GPT-4V in recommendation scenarios. Our objective is to explore the potential of extending LMMs from vision and language tasks to recommendation tasks. We hope to inspire further research into next-generation multimodal generative recommendation models, which can enhance user experiences by offering greater diversity and interactivity. All images and prompts used in this report will be accessible at https://github.com/PALIN2018/Evaluate_GPT-4V_Rec.

Exploring Recommendation Capabilities of GPT-4V(ision): A Preliminary Case Study

TL;DR

This study investigates GPT-4V(ision) as a multimodal recommender by constructing qualitative test cases across culture, art, entertainment, and retail to probe zero-shot recommendations, explanations, and multi-image integration. It demonstrates robust image-text understanding and adaptability without domain-specific training, while highlighting limitations such as prompt sensitivity and limited diversity in some scenarios. The authors discuss evaluation constraints (no quantitative benchmarks due to API access) and ethical considerations, proposing future work on benchmarking, interactive prompts, and safety frameworks. Overall, the work suggests significant potential for next-generation multimodal generative recommender models and emphasizes the need for systematic benchmarks and robust, user-feedback-driven development to improve diversity, interactivity, and reliability in real-world deployments.

Abstract

Large Multimodal Models (LMMs) have demonstrated impressive performance across various vision and language tasks, yet their potential applications in recommendation tasks with visual assistance remain unexplored. To bridge this gap, we present a preliminary case study investigating the recommendation capabilities of GPT-4V(ison), a recently released LMM by OpenAI. We construct a series of qualitative test samples spanning multiple domains and employ these samples to assess the quality of GPT-4V's responses within recommendation scenarios. Evaluation results on these test samples prove that GPT-4V has remarkable zero-shot recommendation abilities across diverse domains, thanks to its robust visual-text comprehension capabilities and extensive general knowledge. However, we have also identified some limitations in using GPT-4V for recommendations, including a tendency to provide similar responses when given similar inputs. This report concludes with an in-depth discussion of the challenges and research opportunities associated with utilizing GPT-4V in recommendation scenarios. Our objective is to explore the potential of extending LMMs from vision and language tasks to recommendation tasks. We hope to inspire further research into next-generation multimodal generative recommendation models, which can enhance user experiences by offering greater diversity and interactivity. All images and prompts used in this report will be accessible at https://github.com/PALIN2018/Evaluate_GPT-4V_Rec.
Paper Structure (28 sections, 29 figures)

This paper contains 28 sections, 29 figures.

Figures (29)

  • Figure 1: Culture&Art-Case1. GPT4-V is asked to recommend related art pieces to users based on a painting of a certain school. It successfully identifies the school that the painting belongs to ( i.e.,, Suprematism and Constructivism) and its history period. Moreover, it also offers recommendations of highly related artists and their art pieces. Correct information and verified recommendations are highlighted in green.
  • Figure 2: Culture&Art-Case2. GPT4-V is asked to recommend art pieces to users based on a poster with a certain historical background and aesthetic style. GPT4-V successfully identifies the poster's specific historical background and gives highly related art pieces to users. Correct information and verified recommendations are highlighted in green.
  • Figure 3: Culture&Art-Case3. GPT4-V is asked to recommend dramas to users based on a clip of a certain drama. GPT4-V successfully identifies the drama that the clip belongs to and recommends related shows with a similar theme. Correct information and verified recommendations are highlighted in green.
  • Figure 4: Culture&Art-Case4. GPT4-V is first asked to identify the story and artist that the illustration belongs to and then offer recommendations based on the previous answers. The interaction is marked in order, and correct information and verified recommendations are highlighted in green. No.1: GPT4-V achieves identification of the artist but fails identification of the story. No.2: Given the story context, GPT4-V successfully offers recommendations of other illustrations in the same or other stories. No.3: GPT4-V successfully identifies the figure but fails to identify the scene that the illustration belongs to and its story background. (highlighted in yellow) No.4: GPT4-V successfully identifies the theme and offers recommendations of highly related operas.
  • Figure 5: Culture&Art-Case5. GPT4-V is asked first to identify the author of an illustration and then offer recommendations with an understanding of the illustration. GPT4-V successfully identifies the illustration's author and also offers some recommendations based on the context of the illustration. Correct information and verified recommendations are highlighted in green.
  • ...and 24 more figures