Prompting Large Vision-Language Models for Compositional Reasoning

Timothy Ossowski; Ming Jiang; Junjie Hu

Prompting Large Vision-Language Models for Compositional Reasoning

Timothy Ossowski, Ming Jiang, Junjie Hu

TL;DR

This paper tackles the challenge that embedding-based vision-language models struggle with visio-linguistic compositionality on Winoground. It introduces KeyComp, a tuning-free, three-step framework that uses keyword detection to guide image description generation and then employs a powerful LLM to perform multi-step reasoning and explanation, enabling text and image decision tasks. Empirically, KeyComp achieves a new state-of-the-art image score on Winoground and competitive text/group scores, with ablations showing the critical role of high-quality image descriptions and carefully designed prompts; an upper-bound analysis indicates substantial gains possible by selecting the best among multiple generated descriptions. The work highlights a practical alternative to purely embedding-based approaches and points to future directions in improved VLM content understanding and prompt-learning to further enhance compositional reasoning in vision-language systems.

Abstract

Vision-language models such as CLIP have shown impressive capabilities in encoding texts and images into aligned embeddings, enabling the retrieval of multimodal data in a shared embedding space. However, these embedding-based models still face challenges in effectively matching images and texts with similar visio-linguistic compositionality, as evidenced by their performance on the recent Winoground dataset. In this paper, we argue that this limitation stems from two factors: the use of single vector representations for complex multimodal data, and the absence of step-by-step reasoning in these embedding-based methods. To address this issue, we make an exploratory step using a novel generative method that prompts large vision-language models (e.g., GPT-4) to depict images and perform compositional reasoning. Our method outperforms other embedding-based methods on the Winoground dataset, and obtains further improvement of up to 10% accuracy when enhanced with the optimal description.

Prompting Large Vision-Language Models for Compositional Reasoning

TL;DR

Abstract

Paper Structure (28 sections, 2 equations, 21 figures, 6 tables)

This paper contains 28 sections, 2 equations, 21 figures, 6 tables.

Introduction
Method
Problem Definition
Step 1: Keyword Detection
Step 2: Keyword-guided Description
Step 3: LLM Reasoning & Explanation
Experimental Settings
Dataset & Evaluation
Methods in Comparison
Model Selection & Hyperparameters
Results and Analysis
Overall Performance
Image Description Quality Matters.
Error Analysis and Findings
Prompt and Model Ablations
...and 13 more sections

Figures (21)

Figure 1: Illustration of our generative method for the Winoground task. Appendix \ref{['sec:question_categories']} shows more detailed descriptions and model outputs. Text Score Task: Our method chooses the more appropriate caption given a single image. Image Score Task: Our method chooses the best image given a single caption.
Figure 2: A detailed example for the image score task.
Figure 3: Fine-grained text score performance across different question categories. We give specific examples from each category in Appendix \ref{['sec:question_categories']}. Percentages on the x-axis indicate each question type's proportion of the dataset. To ensure representative results, question categories comprising less than 5% of the dataset are excluded.
Figure 4: Non-Compositional Question. The swapped words ("Fire" and "Truck") do not necessarily contain the same semantic entities, so compositional reasoning may not be required to solve the question.
Figure 5: Ambiguously Correct Question. Note that the correct caption B describes the woman as lying on the couch when she is sitting, but the LLM is still able to pick the ambiguously correct caption.
...and 16 more figures

Prompting Large Vision-Language Models for Compositional Reasoning

TL;DR

Abstract

Prompting Large Vision-Language Models for Compositional Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (21)