Table of Contents
Fetching ...

Prompting Large Vision-Language Models for Compositional Reasoning

Timothy Ossowski, Ming Jiang, Junjie Hu

TL;DR

This paper tackles the challenge that embedding-based vision-language models struggle with visio-linguistic compositionality on Winoground. It introduces KeyComp, a tuning-free, three-step framework that uses keyword detection to guide image description generation and then employs a powerful LLM to perform multi-step reasoning and explanation, enabling text and image decision tasks. Empirically, KeyComp achieves a new state-of-the-art image score on Winoground and competitive text/group scores, with ablations showing the critical role of high-quality image descriptions and carefully designed prompts; an upper-bound analysis indicates substantial gains possible by selecting the best among multiple generated descriptions. The work highlights a practical alternative to purely embedding-based approaches and points to future directions in improved VLM content understanding and prompt-learning to further enhance compositional reasoning in vision-language systems.

Abstract

Vision-language models such as CLIP have shown impressive capabilities in encoding texts and images into aligned embeddings, enabling the retrieval of multimodal data in a shared embedding space. However, these embedding-based models still face challenges in effectively matching images and texts with similar visio-linguistic compositionality, as evidenced by their performance on the recent Winoground dataset. In this paper, we argue that this limitation stems from two factors: the use of single vector representations for complex multimodal data, and the absence of step-by-step reasoning in these embedding-based methods. To address this issue, we make an exploratory step using a novel generative method that prompts large vision-language models (e.g., GPT-4) to depict images and perform compositional reasoning. Our method outperforms other embedding-based methods on the Winoground dataset, and obtains further improvement of up to 10% accuracy when enhanced with the optimal description.

Prompting Large Vision-Language Models for Compositional Reasoning

TL;DR

This paper tackles the challenge that embedding-based vision-language models struggle with visio-linguistic compositionality on Winoground. It introduces KeyComp, a tuning-free, three-step framework that uses keyword detection to guide image description generation and then employs a powerful LLM to perform multi-step reasoning and explanation, enabling text and image decision tasks. Empirically, KeyComp achieves a new state-of-the-art image score on Winoground and competitive text/group scores, with ablations showing the critical role of high-quality image descriptions and carefully designed prompts; an upper-bound analysis indicates substantial gains possible by selecting the best among multiple generated descriptions. The work highlights a practical alternative to purely embedding-based approaches and points to future directions in improved VLM content understanding and prompt-learning to further enhance compositional reasoning in vision-language systems.

Abstract

Vision-language models such as CLIP have shown impressive capabilities in encoding texts and images into aligned embeddings, enabling the retrieval of multimodal data in a shared embedding space. However, these embedding-based models still face challenges in effectively matching images and texts with similar visio-linguistic compositionality, as evidenced by their performance on the recent Winoground dataset. In this paper, we argue that this limitation stems from two factors: the use of single vector representations for complex multimodal data, and the absence of step-by-step reasoning in these embedding-based methods. To address this issue, we make an exploratory step using a novel generative method that prompts large vision-language models (e.g., GPT-4) to depict images and perform compositional reasoning. Our method outperforms other embedding-based methods on the Winoground dataset, and obtains further improvement of up to 10% accuracy when enhanced with the optimal description.
Paper Structure (28 sections, 2 equations, 21 figures, 6 tables)

This paper contains 28 sections, 2 equations, 21 figures, 6 tables.

Figures (21)

  • Figure 1: Illustration of our generative method for the Winoground task. Appendix \ref{['sec:question_categories']} shows more detailed descriptions and model outputs. Text Score Task: Our method chooses the more appropriate caption given a single image. Image Score Task: Our method chooses the best image given a single caption.
  • Figure 2: A detailed example for the image score task.
  • Figure 3: Fine-grained text score performance across different question categories. We give specific examples from each category in Appendix \ref{['sec:question_categories']}. Percentages on the x-axis indicate each question type's proportion of the dataset. To ensure representative results, question categories comprising less than 5% of the dataset are excluded.
  • Figure 4: Non-Compositional Question. The swapped words ("Fire" and "Truck") do not necessarily contain the same semantic entities, so compositional reasoning may not be required to solve the question.
  • Figure 5: Ambiguously Correct Question. Note that the correct caption B describes the woman as lying on the couch when she is sitting, but the LLM is still able to pick the ambiguously correct caption.
  • ...and 16 more figures