Table of Contents
Fetching ...

Large Language Models can Share Images, Too!

Young-Jun Lee, Dokyong Lee, Joo Won Sung, Jonghwan Hyeon, Ho-Jin Choi

TL;DR

This work investigates whether Large Language Models can share images in dialogue under zero-shot prompting. It introduces DribeR, a three-stage, gradient-free framework (Decide, Describe, Retrieve) and PhotoChat++ a richly annotated extension of PhotoChat to evaluate image-sharing cues, intents, and retrieval relevance. Across ChatGPT, GPT-4, and open-source LLMs, DribeR unlocks image-sharing capabilities, reveals emergent zero-shot behavior, and shows CoT reasoning offers limited or negative benefits in this setting, while few-shot prompts can introduce confusion. The approach demonstrates practical utility in real-world human-bot interactions and dataset augmentation, with code and data publicly available, enabling broader study of image-sharing in multi-modal dialogue.

Abstract

This paper explores the image-sharing capability of Large Language Models (LLMs), such as GPT-4 and LLaMA 2, in a zero-shot setting. To facilitate a comprehensive evaluation of LLMs, we introduce the PhotoChat++ dataset, which includes enriched annotations (i.e., intent, triggering sentence, image description, and salient information). Furthermore, we present the gradient-free and extensible Decide, Describe, and Retrieve (DribeR) framework. With extensive experiments, we unlock the image-sharing capability of DribeR equipped with LLMs in zero-shot prompting, with ChatGPT achieving the best performance. Our findings also reveal the emergent image-sharing ability in LLMs under zero-shot conditions, validating the effectiveness of DribeR. We use this framework to demonstrate its practicality and effectiveness in two real-world scenarios: (1) human-bot interaction and (2) dataset augmentation. To the best of our knowledge, this is the first study to assess the image-sharing ability of various LLMs in a zero-shot setting. We make our source code and dataset publicly available at https://github.com/passing2961/DribeR.

Large Language Models can Share Images, Too!

TL;DR

This work investigates whether Large Language Models can share images in dialogue under zero-shot prompting. It introduces DribeR, a three-stage, gradient-free framework (Decide, Describe, Retrieve) and PhotoChat++ a richly annotated extension of PhotoChat to evaluate image-sharing cues, intents, and retrieval relevance. Across ChatGPT, GPT-4, and open-source LLMs, DribeR unlocks image-sharing capabilities, reveals emergent zero-shot behavior, and shows CoT reasoning offers limited or negative benefits in this setting, while few-shot prompts can introduce confusion. The approach demonstrates practical utility in real-world human-bot interactions and dataset augmentation, with code and data publicly available, enabling broader study of image-sharing in multi-modal dialogue.

Abstract

This paper explores the image-sharing capability of Large Language Models (LLMs), such as GPT-4 and LLaMA 2, in a zero-shot setting. To facilitate a comprehensive evaluation of LLMs, we introduce the PhotoChat++ dataset, which includes enriched annotations (i.e., intent, triggering sentence, image description, and salient information). Furthermore, we present the gradient-free and extensible Decide, Describe, and Retrieve (DribeR) framework. With extensive experiments, we unlock the image-sharing capability of DribeR equipped with LLMs in zero-shot prompting, with ChatGPT achieving the best performance. Our findings also reveal the emergent image-sharing ability in LLMs under zero-shot conditions, validating the effectiveness of DribeR. We use this framework to demonstrate its practicality and effectiveness in two real-world scenarios: (1) human-bot interaction and (2) dataset augmentation. To the best of our knowledge, this is the first study to assess the image-sharing ability of various LLMs in a zero-shot setting. We make our source code and dataset publicly available at https://github.com/passing2961/DribeR.
Paper Structure (78 sections, 15 figures, 8 tables)

This paper contains 78 sections, 15 figures, 8 tables.

Figures (15)

  • Figure 1: An illustration of human's internal two-stage system of the image-sharing behavior.
  • Figure 2: Analysis of PhotoChat++. (Left) the distribution of triggering sentence and salient information. (Right) the intent distribution.
  • Figure 3: An illustration of our proposed framework: Decide, Describe, and Retrieve (DribeR)
  • Figure 4: Results for Intent$_{\textsc{[Choice]}}$ and Sentence$_{\textsc{[Dist.]}}$ are presented, with All signifying instances where DribeR correctly identifies Decision$_{\textsc{[Y/N]}}$, Intent$_{\textsc{[Choice]}}$, and Sentence$_{\textsc{[Dist.]}}$ simultaneously.
  • Figure 5: Zero-shot results of Hits@10 (%) across multiple dialogue rounds are presented, comparing the ChatIR system levy2023chatting with DribeR. It should be noted that a dialogue comprising 0 rounds indicates that only the image description is provided to the model.
  • ...and 10 more figures