Assessing GPT4-V on Structured Reasoning Tasks
Mukul Singh, José Cambronero, Sumit Gulwani, Vu Le, Gust Verbruggen
TL;DR
This work evaluates GPT-4V on structured reasoning tasks spanning mathematics, visual data analysis, abstraction, and code generation. It introduces visual Chain-of-Thought (v-CoT) as a multimodal extension of CoT and demonstrates that v-CoT can significantly improve performance over vanilla GPT-4V on MathVista and ChartQA, with mixed results on ARC and modest gains on Spider. The study provides a detailed pattern-based analysis of successes and failures, highlighting common pitfalls such as arithmetic errors, color-labeling issues, and grid-perception challenges. Overall, the results underscore the potential of multimodal prompting strategies for complex reasoning while pointing to areas requiring further methodological and dataset improvements.
Abstract
Multi-modality promises to unlock further uses for large language models. Recently, the state-of-the-art language model GPT-4 was enhanced with vision capabilities. We carry out a prompting evaluation of GPT-4V and five other baselines on structured reasoning tasks, such as mathematical reasoning, visual data analysis, and code generation. We show that visual Chain-of-Thought, an extension of Chain-of-Thought to multi-modal LLMs, yields significant improvements over the vanilla model. We also present a categorized analysis of scenarios where these models perform well and where they struggle, highlighting challenges associated with coherent multimodal reasoning.
