Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis

Aishik Nagar; Shantanu Jaiswal; Cheston Tan

Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis

Aishik Nagar, Shantanu Jaiswal, Cheston Tan

TL;DR

This work probes whether zero-shot visual reasoning in vision-language systems stems from true visual inference or world knowledge by leveraging synthetic datasets (CLEVR and PTR) designed to minimize world knowledge and enable analysis across many reasoning steps. It compares two input modalities—textual scene descriptions fed to LLMs versus visual embeddings fed to VLMs—and evaluates chain-of-thought prompting versus standard prompting. Results show that pure LLMs given textual scene information consistently outperform VLMs, with PTR showing about an 18% accuracy advantage; CoT prompting provides gains mainly at very large model scales, indicating emergent reasoning capabilities in LLMs for visual tasks. Overall, the findings reveal limitations in current VLMs and highlight the significant enabling role of LLMs in enhancing visual reasoning, while suggesting directions for improved prompting and model integration to achieve more robust multimodal reasoning.

Abstract

Vision-language models (VLMs) have shown impressive zero- and few-shot performance on real-world visual question answering (VQA) benchmarks, alluding to their capabilities as visual reasoning engines. However, the benchmarks being used conflate "pure" visual reasoning with world knowledge, and also have questions that involve a limited number of reasoning steps. Thus, it remains unclear whether a VLM's apparent visual reasoning performance is due to its world knowledge, or due to actual visual reasoning capabilities. To clarify this ambiguity, we systematically benchmark and dissect the zero-shot visual reasoning capabilities of VLMs through synthetic datasets that require minimal world knowledge, and allow for analysis over a broad range of reasoning steps. We focus on two novel aspects of zero-shot visual reasoning: i) evaluating the impact of conveying scene information as either visual embeddings or purely textual scene descriptions to the underlying large language model (LLM) of the VLM, and ii) comparing the effectiveness of chain-of-thought prompting to standard prompting for zero-shot visual reasoning. We find that the underlying LLMs, when provided textual scene descriptions, consistently perform better compared to being provided visual embeddings. In particular, 18% higher accuracy is achieved on the PTR dataset. We also find that CoT prompting performs marginally better than standard prompting only for the comparatively large GPT-3.5-Turbo (175B) model, and does worse for smaller-scale models. This suggests the emergence of CoT abilities for visual reasoning in LLMs at larger scales even when world knowledge is limited. Overall, we find limitations in the abilities of VLMs and LLMs for more complex visual reasoning, and highlight the important role that LLMs can play in visual reasoning.

Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis

TL;DR

Abstract

Paper Structure (36 sections, 14 figures, 7 tables)

This paper contains 36 sections, 14 figures, 7 tables.

Introduction
Summary of Experiments and Findings
Contributions
Related Work
Experiments
Experimental Design
Experimental Setup
Results and Analyses
Comparing LLMs with scene descriptions versus VLMs
Chain-of-Thought (CoT) Reasoning
Conclusion
Acknowledgement
Appendix
Experiment Code and Reproducibility
Additional Figures and analysis
...and 21 more sections

Figures (14)

Figure 1: The experimental setup. We perform experiments on pure LLMs as well as their VLM variants with the same set of prompts. In case of LLMs, the image information is provided using the scene metadata used to render the image.
Figure 2: LLM versus VLM+Metadata versus VLM performance on CLEVR and PTR.
Figure 3: LLM versus VLM performance of Flan-T5-XXL on CLEVR and PTR, analyzed by length of functional programs (a proxy for number of reasoning steps). Error bars represent standard error; large error bars for functional programs longer than 18 are due to the small number of questions.
Figure 4: LLM versus VLM model performance of Flan-T5-XXL on CLEVR and PTR using standard prompting, organized by question family.
Figure 5: LLM versus VLM model performance of GPT-4 on CLEVR and PTR using standard prompting, organized by question family.
...and 9 more figures

Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis

TL;DR

Abstract

Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (14)