The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task

Yifan Wu; Pengchuan Zhang; Wenhan Xiong; Barlas Oguz; James C. Gee; Yixin Nie

The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task

Yifan Wu, Pengchuan Zhang, Wenhan Xiong, Barlas Oguz, James C. Gee, Yixin Nie

TL;DR

The study addresses the gap in vision-language reasoning performance by testing a brain-inspired Chain-of-Thought prompting approach. It introduces a Description then Decision strategy to decompose complex tasks into perception and reasoning steps and evaluates it on Winoground using GPT-4V and other vision-language systems. Results show substantial gains, including a 50% increase in the Group score (39.25 to 58.75) and notable image-score improvements (46.25 to 68.75), with two-turn prompts delivering further boosts and reducing modality gaps. An error analysis highlights persistent difficulties in temporal, pragmatic, and abstract reasoning, guiding future directions for reasoning paradigms in vision-language tasks.

Abstract

The study explores the effectiveness of the Chain-of-Thought approach, known for its proficiency in language tasks by breaking them down into sub-tasks and intermediate steps, in improving vision-language tasks that demand sophisticated perception and reasoning. We present the "Description then Decision" strategy, which is inspired by how humans process signals. This strategy significantly improves probing task performance by 50%, establishing the groundwork for future research on reasoning paradigms in complex vision-language tasks.

The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task

TL;DR

Abstract

Paper Structure (7 sections, 4 equations, 6 figures, 3 tables)

This paper contains 7 sections, 4 equations, 6 figures, 3 tables.

Introduction
Probing Task
Evaluation
The Role of Chain-of-Thought
The Effect of Two-turns Prompt
Error Analysis
Conclusion

Figures (6)

Figure 1: Brain-inspired Two-step Reasoning. Figure (a) is adapted from Wikipedia under license CC BY-SA 3.0 DEED. Figure (b) is adapted from Winoground thrush2022winoground.
Figure 2: Examples of results with different prompt configurations. Text in blue highlights differences in the prompts. All figures shown here are from Winoground thrush2022winoground
Figure 3: Error analysis by tag category across different GPT-4V prompt configurations. Each bar represents the accuracy of a specific experiment configuration on a given tag category. From left to right, the experiments are: GPT-4V (1-turn), GPT-4V CoT (1-turn), GPT-4V Desp + GPT-4 QA (2-turns), GPT-4V Desp + GPT-4 CoT (2-turns), GPT-4V Desp + GPT-4V QA (2-turns), and GPT-4V Desp + GPT-4V CoT (2-turns).
Figure 4: The qualitative examples demonstrate the effect of Chain-of-Thought. For each example, the left side illustrates the text choice setting and the right side depicts the image choice setting. The top portion shows the outcome without Chain-of-Thought, referred to as "GPT-4V (1-turn)," while the bottom part shows the results with Chain-of-Thought, labeled as "GPT-4V CoT (1-turn)" in Table \ref{['benchmark']}. All images shown here are from Winoground thrush2022winoground.
Figure 5: The qualitative examples (continued) demonstrate the effect of Chain-of-Thought. For each example, the left side illustrates the text choice setting and the right side depicts the image choice setting. The top portion shows the outcome without Chain-of-Thought, referred to as "GPT-4V (1-turn)," while the bottom part shows the results with Chain-of-Thought, labeled as "GPT-4V CoT (1-turn)" in Table \ref{['benchmark']}. All images shown here are from Winoground thrush2022winoground.
...and 1 more figures

The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task

TL;DR

Abstract

The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task

Authors

TL;DR

Abstract

Table of Contents

Figures (6)