MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

Xianwei Mao; Kai Ye; Sheng Zhou; Nan Zhang; Haikuan Huang; Bin Li; Jiajun Bu

MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

Xianwei Mao, Kai Ye, Sheng Zhou, Nan Zhang, Haikuan Huang, Bin Li, Jiajun Bu

TL;DR

MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge .

Abstract

Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the visual content, while internal model knowledge is difficult to control and interpret. Naive aggregation of these sources limits reasoning effectiveness and reduces answer accuracy. To address this, we propose MaS-VQA, a selection-driven framework that tightly couples explicit knowledge filtering with implicit knowledge reasoning. MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge . This filtered knowledge then guides the activation of internal knowledge in a constrained semantic space, enabling complementary co-modeling of explicit and implicit knowledge for robust answer prediction. Experiments on Encyclopedic-VQA and InfoSeek demonstrate consistent performance gains across multiple MLLM backbones, and ablations verify that the selection mechanism effectively reduces noise and enhances knowledge utilization.

MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

TL;DR

Abstract

Paper Structure (31 sections, 18 equations, 7 figures, 4 tables)

This paper contains 31 sections, 18 equations, 7 figures, 4 tables.

Introduction
Related Work
Explicit Knowledge Methods
Implicit Knowledge Methods
Hybrid Explicit-implied Knowledge Methods
Method
Task Formulation
Explicit Knowledge Processing: Text and Image Signals
Image-side: knowledge-guided attention mask generation.
Adaptive token reweighting.
Token-wise thresholding and patch mask composition.
Text-side: question-conditioned phrase selection from retrieved knowledge.
Implicit Knowledge Processing
Inputs and format.
Role of implicit knowledge.
...and 16 more sections

Figures (7)

Figure 1: A comparison of vanilla KB-VQA and our proposed method. Compared to standard hybrid methods that separate explicit and implicit knowledge, MaS-VQA integrates their reasoning.
Figure 2: Overview of MaS-VQA. Given an image--question pair, MaS-VQA retrieves top-$k$ passages from an external knowledge base and performs Mask-and-Select explicit knowledge processing, including a knowledge-guided attention mask for filtering irrelevant image regions and question-conditioned phrase selection for pruning noisy text. The filtered multimodal evidence is then used for implicit knowledge processing to elicit complementary model-internal knowledge, and both knowledge sources are co-modeled for final answer prediction.
Figure 3: Qualitative case studies. Left: explicit external knowledge helps bridge missing factual gaps and corrects errors made without retrieval. Right: implicit knowledge complements retrieved evidence when the final decision requires commonsense/domain priors beyond the retrieved text.
Figure 4: Implicit Knowledge Processing complements explicit evidence. Even with filtered explicit knowledge, some questions require additional commonsense/domain priors. Our implicit knowledge processing elicits such parametric knowledge conditioned on the selected evidence, enabling correct reasoning and answers.
Figure 5: Implicit Knowledge Processing complements explicit evidence. Even with filtered explicit knowledge, some questions require additional commonsense/domain priors. Our implicit knowledge processing elicits such parametric knowledge conditioned on the selected evidence, enabling correct reasoning and answers.
...and 2 more figures

MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

TL;DR

Abstract

MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (7)