ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

Irene Huang; Wei Lin; M. Jehanzeb Mirza; Jacob A. Hansen; Sivan Doveh; Victor Ion Butoi; Roei Herzig; Assaf Arbelle; Hilde Kuehne; Trevor Darrell; Chuang Gan; Aude Oliva; Rogerio Feris; Leonid Karlinsky

ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

Irene Huang, Wei Lin, M. Jehanzeb Mirza, Jacob A. Hansen, Sivan Doveh, Victor Ion Butoi, Roei Herzig, Assaf Arbelle, Hilde Kuehne, Trevor Darrell, Chuang Gan, Aude Oliva, Rogerio Feris, Leonid Karlinsky

TL;DR

The paper tackles the persistent challenge of compositional reasoning in modern vision-language models (VLMs) by critiquing existing CR benchmarks that rely on LLM-generated negatives. It proposes ConMe, a CR benchmark built via an automated pipeline that uses GPT-4V in conjunction with current open VLMs to orchestrate inter-VLM conversations, generating, evaluating, and selecting hard CR QA pairs with image context. The results show substantial CR performance drops for state-of-the-art VLMs on ConMe (up to ~33% relative to SugarCrepe) and demonstrate generalization to unseen models, supported by a manual verification subset. Additionally, the work introduces an automatic error-taxonomy analysis tool to mine weaknesses and provide actionable insights for improving CR capabilities in VLMs.

Abstract

Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmarks may not adequately push the boundaries of modern VLMs due to the reliance on an LLM-only negative text generation pipeline. Consequently, the negatives produced either appear as outliers from the natural language distribution learned by VLMs' LLM decoders or as improbable within the corresponding image context. To address these limitations, we introduce ConMe -- a compositional reasoning benchmark and a novel data generation pipeline leveraging VLMs to produce `hard CR Q&A'. Through a new concept of VLMs conversing with each other to collaboratively expose their weaknesses, our pipeline autonomously generates, evaluates, and selects challenging compositional reasoning questions, establishing a robust CR benchmark, also subsequently validated manually. Our benchmark provokes a noteworthy, up to 33%, decrease in CR performance compared to preceding benchmarks, reinstating the CR challenge even for state-of-the-art VLMs.

ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

TL;DR

Abstract

Paper Structure (17 sections, 1 equation, 16 figures, 6 tables)

This paper contains 17 sections, 1 equation, 16 figures, 6 tables.

Introduction
Background and Related Work
ConMe: A Compositional Reasoning Benchmark
Hard Compositional Reasoning (CR) QA Generation Pipeline
Experiments
Datasets
Results
Analysis and Ablations
Comparison with Perplexity Inference Evaluation
Error Taxonomy Analysis
Conclusions and Limitations
Limitations
Prompts for CR QA Generation
Perplexity Prompts and Evaluations
Error Taxonomies Definition
...and 2 more sections

Figures (16)

Figure 1: We propose a new concept of VLMs conversing with each other to collaboratively expose their weaknesses. Our pipeline autonomously generates, evaluates, and selects challenging Compositional Reasoning (CR) questions, to establish a robust CR benchmark -- ConMe.
Figure 2: Our proposed CR data generation framework employs multiple open VLMs in a multi-stage 'conversation' setup. Given an image, first, GPT-4V and the VLMs are prompted to describe the image in detail. Then, providing all the generated descriptions from the VLMs and the GPT-4V itself as context, GPT-4V is tasked with the generation of the first iteration of CR questions, and the VLMs are evaluated on these questions and also prompted to generate open-ended answers. Finally, GPT-4V is again employed and prompted to generate more challenging CR questions with the additional context from the previous iterations output resulting in challenging CR questions, and their correct answers (positives) and confusing wrong answers (negatives).
Figure 3: Distribution of mistake rates of various VLMs across different error categories automatically obtained by our proposed analysis framework. Table \ref{['tab:error_categories_taxonomy']} in the Appendix specifies each error category.
Figure 4: Distribution of mistake rates of various VLMs across different CR QA formats automatically obtained by our proposed analysis framework. Tab. \ref{['tab:question_formats_taxonomy']} in the Appendix specifies each CR QA format.
Figure 5: Step 1 Prompt
...and 11 more figures

ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

TL;DR

Abstract

ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (16)