Prompting Large Vision-Language Models for Compositional Reasoning
Timothy Ossowski, Ming Jiang, Junjie Hu
TL;DR
This paper tackles the challenge that embedding-based vision-language models struggle with visio-linguistic compositionality on Winoground. It introduces KeyComp, a tuning-free, three-step framework that uses keyword detection to guide image description generation and then employs a powerful LLM to perform multi-step reasoning and explanation, enabling text and image decision tasks. Empirically, KeyComp achieves a new state-of-the-art image score on Winoground and competitive text/group scores, with ablations showing the critical role of high-quality image descriptions and carefully designed prompts; an upper-bound analysis indicates substantial gains possible by selecting the best among multiple generated descriptions. The work highlights a practical alternative to purely embedding-based approaches and points to future directions in improved VLM content understanding and prompt-learning to further enhance compositional reasoning in vision-language systems.
Abstract
Vision-language models such as CLIP have shown impressive capabilities in encoding texts and images into aligned embeddings, enabling the retrieval of multimodal data in a shared embedding space. However, these embedding-based models still face challenges in effectively matching images and texts with similar visio-linguistic compositionality, as evidenced by their performance on the recent Winoground dataset. In this paper, we argue that this limitation stems from two factors: the use of single vector representations for complex multimodal data, and the absence of step-by-step reasoning in these embedding-based methods. To address this issue, we make an exploratory step using a novel generative method that prompts large vision-language models (e.g., GPT-4) to depict images and perform compositional reasoning. Our method outperforms other embedding-based methods on the Winoground dataset, and obtains further improvement of up to 10% accuracy when enhanced with the optimal description.
