Table of Contents
Fetching ...

Thought-For-Food: Reasoning Chain Induced Food Visual Question Answering

Riddhi Jain, Manasi Patwardhan, Parijat Deshpande, Venkataramana Runkana

TL;DR

This work tackles Visual Question Answering for Indian cuisine by auto-synthesizing reasoning chains that connect visual food items with culinary context. It extends the IndiFoodVQA dataset with structured reasoning (COT) chains and trains multimodal models via supervised fine-tuning and reinforcement learning (DPO/GRPO), achieving substantial gains over baselines. The approach yields state-of-the-art performance on the IndiFoodVQA benchmark (best model ~71.12% accuracy) and demonstrates that reasoning-driven training improves multi-step, domain-specific questions, while domain knowledge augmentation provides selective benefits. The framework offers a scalable path to culturally aware VQA systems with interpretable reasoning, useful for nutritional analysis, culinary education, and cultural documentation.

Abstract

The immense diversity in the culture and culinary of Indian cuisines calls attention to the major shortcoming of the existing Visual Question Answering(VQA) systems which are inclined towards the foods from Western region. Recent attempt towards building a VQA dataset for Indian food is a step towards addressing this challenge. However, their approach towards VQA follows a two-step process in which the answer is generated first, followed by the explanation of the expected answer. In this work, we claim that food VQA requires to follow a multi-step reasoning process to arrive at an accurate answer, especially in the context of India food, which involves understanding complex culinary context and identifying relationships between various food items. With this hypothesis we create reasoning chains upon the QA with minimal human intervention. We fine-tune smaller LLMs and VLMs with auto-validated reasoning chains and further train them using reinforcement learning with larger data. With augmentation of reasoning chains, we observed accuracy improvement of an average 10 percentage points on the baseline. We provide detailed analysis in terms the effect of addition of reasoning chains for the Indian Food VQA task. Index Terms - FoodVQA, Reasoning Chains, Reinforcement Learning, Knowledge Graph.

Thought-For-Food: Reasoning Chain Induced Food Visual Question Answering

TL;DR

This work tackles Visual Question Answering for Indian cuisine by auto-synthesizing reasoning chains that connect visual food items with culinary context. It extends the IndiFoodVQA dataset with structured reasoning (COT) chains and trains multimodal models via supervised fine-tuning and reinforcement learning (DPO/GRPO), achieving substantial gains over baselines. The approach yields state-of-the-art performance on the IndiFoodVQA benchmark (best model ~71.12% accuracy) and demonstrates that reasoning-driven training improves multi-step, domain-specific questions, while domain knowledge augmentation provides selective benefits. The framework offers a scalable path to culturally aware VQA systems with interpretable reasoning, useful for nutritional analysis, culinary education, and cultural documentation.

Abstract

The immense diversity in the culture and culinary of Indian cuisines calls attention to the major shortcoming of the existing Visual Question Answering(VQA) systems which are inclined towards the foods from Western region. Recent attempt towards building a VQA dataset for Indian food is a step towards addressing this challenge. However, their approach towards VQA follows a two-step process in which the answer is generated first, followed by the explanation of the expected answer. In this work, we claim that food VQA requires to follow a multi-step reasoning process to arrive at an accurate answer, especially in the context of India food, which involves understanding complex culinary context and identifying relationships between various food items. With this hypothesis we create reasoning chains upon the QA with minimal human intervention. We fine-tune smaller LLMs and VLMs with auto-validated reasoning chains and further train them using reinforcement learning with larger data. With augmentation of reasoning chains, we observed accuracy improvement of an average 10 percentage points on the baseline. We provide detailed analysis in terms the effect of addition of reasoning chains for the Indian Food VQA task. Index Terms - FoodVQA, Reasoning Chains, Reinforcement Learning, Knowledge Graph.

Paper Structure

This paper contains 22 sections, 8 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Valid Reasoning Chain Generation Method
  • Figure 2: Images for Human Generated Reasoning Chains in Tables \ref{['ex1']} and \ref{['ex2']}