Modularized Zero-shot VQA with Pre-trained Models

Rui Cao; Jing Jiang

Modularized Zero-shot VQA with Pre-trained Models

Rui Cao, Jing Jiang

TL;DR

This work presents Mod-Zero-VQA, a modular zero-shot VQA framework that decomposes questions into explicit sub-steps and assigns each step to suitable pre-trained models (OWL for object detection, MDETR for grounded descriptions, CLIP for answer generation). It adds spatial heuristics to mitigate CLIP’s weaknesses in spatial reasoning and maps traditional NMN layouts to a minimal, zero-shot module set, enabling interpretable reasoning without downstream VQA training. Empirical results on GQA show a near $13\%$ relative improvement over the strongest zero-shot baselines, with strong interpretability from the explicit reasoning chain, and robust performance under out-of-domain conditions. The approach offers a flexible, extensible pathway to leverage advancing PTMs for complex multimodal reasoning while providing transparency into the reasoning process and failure modes.

Abstract

Large-scale pre-trained models (PTMs) show great zero-shot capabilities. In this paper, we study how to leverage them for zero-shot visual question answering (VQA). Our approach is motivated by a few observations. First, VQA questions often require multiple steps of reasoning, which is still a capability that most PTMs lack. Second, different steps in VQA reasoning chains require different skills such as object detection and relational reasoning, but a single PTM may not possess all these skills. Third, recent work on zero-shot VQA does not explicitly consider multi-step reasoning chains, which makes them less interpretable compared with a decomposition-based approach. We propose a modularized zero-shot network that explicitly decomposes questions into sub reasoning steps and is highly interpretable. We convert sub reasoning tasks to acceptable objectives of PTMs and assign tasks to proper PTMs without any adaptation. Our experiments on two VQA benchmarks under the zero-shot setting demonstrate the effectiveness of our method and better interpretability compared with several baselines.

Modularized Zero-shot VQA with Pre-trained Models

TL;DR

relative improvement over the strongest zero-shot baselines, with strong interpretability from the explicit reasoning chain, and robust performance under out-of-domain conditions. The approach offers a flexible, extensible pathway to leverage advancing PTMs for complex multimodal reasoning while providing transparency into the reasoning process and failure modes.

Abstract

Paper Structure (34 sections, 4 equations, 8 figures, 11 tables)

This paper contains 34 sections, 4 equations, 8 figures, 11 tables.

Introduction
Background
Task Definition.
Existing Zero-shot VQA Methods.
Modularized Zero-shot VQA
Traditional VQA Modules
Pre-trained Models
OWL.
MDETR.
CLIP.
Zero-shot NMN using Pre-trained Models
Spatial Heuristics
Experiments
Dataset
Implementation Details
...and 19 more sections

Figures (8)

Figure 1: An overview of our proposed method. Instead of training modules in NMN, we propose a modularized zero-shot VQA method leveraging pre-trained models to perform different reasoning tasks.
Figure 2: Visualization of intermediate outputs from reasoning steps of the Mod-Zero-VQA model.
Figure 3: Visualization of existence-related questions where mentioned objects in the questions do not exist.
Figure 4: Visualization of questions asking about existence of attributes.
Figure 5: Visualization of questions asking about the existence of relations.
...and 3 more figures

Modularized Zero-shot VQA with Pre-trained Models

TL;DR

Abstract

Modularized Zero-shot VQA with Pre-trained Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)