Table of Contents
Fetching ...

Cross Domain Evaluation of Multimodal Chain-of-Thought Reasoning of different datasets into the Amazon CoT Framework

Nitya Tiwari, Parv Maheshwari, Vidisha Agarwal

TL;DR

This work assesses how well Multimodal-CoT generalizes from science-focused questions to open-domain multimodal reasoning. It reuses Zhang et al.'s two-stage CoT framework and systematically adapts it to ChartQA, OK-VQA, and A-OKVQA, including data preprocessing, prompt engineering, and metric extensions. The study finds that while vision integration reduces rationale hallucination, performance varies by question type, with ChartQA presenting the biggest challenge due to numerical reasoning and structured visuals, and A-OKVQA showing the strongest cross-domain alignment among the three datasets. The results underscore substantial cross-domain gaps and offer practical guidelines and future directions for improving cross-domain multimodal reasoning under limited computing resources.

Abstract

While recent work has extended CoT to multimodal settings, achieving state-of-the-art results on science question answering benchmarks like ScienceQA, the generalizability of these approaches across diverse domains remains underexplored. This work presents a comprehensive analysis of Multimodal Chain-of-Thought (Multimodal-CoT) reasoning, evaluating its effectiveness on the A-OKVQA, OKVQA and ChartQA datasets, which requires broad commonsense and world knowledge beyond scientific reasoning. We implement the two-stage framework proposed by Zhang et al. [3], which separates rationale generation from answer inference and integrates vision features through a gated fusion mechanism with T5-based language models. Through systematic ablation studies, we analyze the contributions of vision features, rationale quality, and architectural choices. Our findings reveal that while vision integration significantly reduces hallucination in rationale generation, the effectiveness of CoT reasoning varies substantially across question types, with commonsense reasoning presenting particular challenges. This work provides practical insights for researchers implementing multimodal reasoning systems and identifies key areas for future improvement in cross-domain generalization.

Cross Domain Evaluation of Multimodal Chain-of-Thought Reasoning of different datasets into the Amazon CoT Framework

TL;DR

This work assesses how well Multimodal-CoT generalizes from science-focused questions to open-domain multimodal reasoning. It reuses Zhang et al.'s two-stage CoT framework and systematically adapts it to ChartQA, OK-VQA, and A-OKVQA, including data preprocessing, prompt engineering, and metric extensions. The study finds that while vision integration reduces rationale hallucination, performance varies by question type, with ChartQA presenting the biggest challenge due to numerical reasoning and structured visuals, and A-OKVQA showing the strongest cross-domain alignment among the three datasets. The results underscore substantial cross-domain gaps and offer practical guidelines and future directions for improving cross-domain multimodal reasoning under limited computing resources.

Abstract

While recent work has extended CoT to multimodal settings, achieving state-of-the-art results on science question answering benchmarks like ScienceQA, the generalizability of these approaches across diverse domains remains underexplored. This work presents a comprehensive analysis of Multimodal Chain-of-Thought (Multimodal-CoT) reasoning, evaluating its effectiveness on the A-OKVQA, OKVQA and ChartQA datasets, which requires broad commonsense and world knowledge beyond scientific reasoning. We implement the two-stage framework proposed by Zhang et al. [3], which separates rationale generation from answer inference and integrates vision features through a gated fusion mechanism with T5-based language models. Through systematic ablation studies, we analyze the contributions of vision features, rationale quality, and architectural choices. Our findings reveal that while vision integration significantly reduces hallucination in rationale generation, the effectiveness of CoT reasoning varies substantially across question types, with commonsense reasoning presenting particular challenges. This work provides practical insights for researchers implementing multimodal reasoning systems and identifies key areas for future improvement in cross-domain generalization.

Paper Structure

This paper contains 59 sections, 8 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: ChartQA Image
  • Figure 2: Chart QA Text
  • Figure 3: AOKVQA Image
  • Figure 4: AOKVQA Text
  • Figure 5: OKVQA Text
  • ...and 1 more figures