Table of Contents
Fetching ...

Assessing GPT4-V on Structured Reasoning Tasks

Mukul Singh, José Cambronero, Sumit Gulwani, Vu Le, Gust Verbruggen

TL;DR

This work evaluates GPT-4V on structured reasoning tasks spanning mathematics, visual data analysis, abstraction, and code generation. It introduces visual Chain-of-Thought (v-CoT) as a multimodal extension of CoT and demonstrates that v-CoT can significantly improve performance over vanilla GPT-4V on MathVista and ChartQA, with mixed results on ARC and modest gains on Spider. The study provides a detailed pattern-based analysis of successes and failures, highlighting common pitfalls such as arithmetic errors, color-labeling issues, and grid-perception challenges. Overall, the results underscore the potential of multimodal prompting strategies for complex reasoning while pointing to areas requiring further methodological and dataset improvements.

Abstract

Multi-modality promises to unlock further uses for large language models. Recently, the state-of-the-art language model GPT-4 was enhanced with vision capabilities. We carry out a prompting evaluation of GPT-4V and five other baselines on structured reasoning tasks, such as mathematical reasoning, visual data analysis, and code generation. We show that visual Chain-of-Thought, an extension of Chain-of-Thought to multi-modal LLMs, yields significant improvements over the vanilla model. We also present a categorized analysis of scenarios where these models perform well and where they struggle, highlighting challenges associated with coherent multimodal reasoning.

Assessing GPT4-V on Structured Reasoning Tasks

TL;DR

This work evaluates GPT-4V on structured reasoning tasks spanning mathematics, visual data analysis, abstraction, and code generation. It introduces visual Chain-of-Thought (v-CoT) as a multimodal extension of CoT and demonstrates that v-CoT can significantly improve performance over vanilla GPT-4V on MathVista and ChartQA, with mixed results on ARC and modest gains on Spider. The study provides a detailed pattern-based analysis of successes and failures, highlighting common pitfalls such as arithmetic errors, color-labeling issues, and grid-perception challenges. Overall, the results underscore the potential of multimodal prompting strategies for complex reasoning while pointing to areas requiring further methodological and dataset improvements.

Abstract

Multi-modality promises to unlock further uses for large language models. Recently, the state-of-the-art language model GPT-4 was enhanced with vision capabilities. We carry out a prompting evaluation of GPT-4V and five other baselines on structured reasoning tasks, such as mathematical reasoning, visual data analysis, and code generation. We show that visual Chain-of-Thought, an extension of Chain-of-Thought to multi-modal LLMs, yields significant improvements over the vanilla model. We also present a categorized analysis of scenarios where these models perform well and where they struggle, highlighting challenges associated with coherent multimodal reasoning.
Paper Structure (23 sections, 9 figures, 2 tables)

This paper contains 23 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Without reasoning, the model does not find the correct answer. Both m-CoT and v-CoT elicit sufficient reasoning.
  • Figure 2: (1) An example task, (2) the output generated m-CoT, and (3) the output generated with v-CoT. Red and green text highlight incorrect and correct reasoning, respectively.
  • Figure 3: Prompt structure for v-CoT (1) and m-CoT (2) prompts. Blue highlights the shared m-CoT instruction and green highlights our extension.
  • Figure 4: Sample tasks from each benchmark dataset. We show the image and the associated text prompt for the dataset along with the correct answer for the task. For Chart QA and MathVista the answer is a choice or numeric value; For Spider the answer is the correct SQL query; For ARC the output is the correct pixel matrix ($0 \rightarrow black; 1 \rightarrow red; 2 \rightarrow blue$).
  • Figure 5: Manual analysis of GPT-4V + VCoT on our sampled tasks. We manually annotate the reasoning and the final answer separately for all benchmark dataset and present the analysis. For Chart QA and Spider, we find that in 72 -- 85 % cases the model generates the right explanation and is able to generate the correct answer from these. For MathVista and ARC that require more complex reasoning, we find that the model struggles to generate the right reasoning with only 51 -- 58 % reasoning being correct.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Example 1
  • Example 2
  • Example 3