Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning
Ayush Singh, Mansi Gupta, Shivank Garg, Abhinav Kumar, Vansh Agrawal
TL;DR
The paper addresses the challenge that vision language models struggle with mathematical reasoning tasks such as counting and geometry. It compares captioning based QA with task based prompting and shows that while captioning can help in some contexts, its benefits do not generalize across datasets and larger models trained on QnA still fail on math challenges. The proposed task based prompting uses prompts derived solely from the question to guide reasoning and also tests robustness with adversarial and random prompts, showing promising improvements over captioning. The findings suggest that structured prompting to guide problem solving can better unlock mathematical reasoning capabilities in VLMs and guide future design of robust multimodal reasoning systems.
Abstract
Vision-Language Models (VLMs) have transformed tasks requiring visual and reasoning abilities, such as image retrieval and Visual Question Answering (VQA). Despite their success, VLMs face significant challenges with tasks involving geometric reasoning, algebraic problem-solving, and counting. These limitations stem from difficulties effectively integrating multiple modalities and accurately interpreting geometry-related tasks. Various works claim that introducing a captioning pipeline before VQA tasks enhances performance. We incorporated this pipeline for tasks involving geometry, algebra, and counting. We found that captioning results are not generalizable, specifically with larger VLMs primarily trained on downstream QnA tasks showing random performance on math-related challenges. However, we present a promising alternative: task-based prompting, enriching the prompt with task-specific guidance. This approach shows promise and proves more effective than direct captioning methods for math-heavy problems.
