Table of Contents
Fetching ...

MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

Xianwei Mao, Kai Ye, Sheng Zhou, Nan Zhang, Haikuan Huang, Bin Li, Jiajun Bu

TL;DR

MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge .

Abstract

Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the visual content, while internal model knowledge is difficult to control and interpret. Naive aggregation of these sources limits reasoning effectiveness and reduces answer accuracy. To address this, we propose MaS-VQA, a selection-driven framework that tightly couples explicit knowledge filtering with implicit knowledge reasoning. MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge . This filtered knowledge then guides the activation of internal knowledge in a constrained semantic space, enabling complementary co-modeling of explicit and implicit knowledge for robust answer prediction. Experiments on Encyclopedic-VQA and InfoSeek demonstrate consistent performance gains across multiple MLLM backbones, and ablations verify that the selection mechanism effectively reduces noise and enhances knowledge utilization.

MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

TL;DR

MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge .

Abstract

Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the visual content, while internal model knowledge is difficult to control and interpret. Naive aggregation of these sources limits reasoning effectiveness and reduces answer accuracy. To address this, we propose MaS-VQA, a selection-driven framework that tightly couples explicit knowledge filtering with implicit knowledge reasoning. MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge . This filtered knowledge then guides the activation of internal knowledge in a constrained semantic space, enabling complementary co-modeling of explicit and implicit knowledge for robust answer prediction. Experiments on Encyclopedic-VQA and InfoSeek demonstrate consistent performance gains across multiple MLLM backbones, and ablations verify that the selection mechanism effectively reduces noise and enhances knowledge utilization.
Paper Structure (31 sections, 18 equations, 7 figures, 4 tables)

This paper contains 31 sections, 18 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: A comparison of vanilla KB-VQA and our proposed method. Compared to standard hybrid methods that separate explicit and implicit knowledge, MaS-VQA integrates their reasoning.
  • Figure 2: Overview of MaS-VQA. Given an image--question pair, MaS-VQA retrieves top-$k$ passages from an external knowledge base and performs Mask-and-Select explicit knowledge processing, including a knowledge-guided attention mask for filtering irrelevant image regions and question-conditioned phrase selection for pruning noisy text. The filtered multimodal evidence is then used for implicit knowledge processing to elicit complementary model-internal knowledge, and both knowledge sources are co-modeled for final answer prediction.
  • Figure 3: Qualitative case studies. Left: explicit external knowledge helps bridge missing factual gaps and corrects errors made without retrieval. Right: implicit knowledge complements retrieved evidence when the final decision requires commonsense/domain priors beyond the retrieved text.
  • Figure 4: Implicit Knowledge Processing complements explicit evidence. Even with filtered explicit knowledge, some questions require additional commonsense/domain priors. Our implicit knowledge processing elicits such parametric knowledge conditioned on the selected evidence, enabling correct reasoning and answers.
  • Figure 5: Implicit Knowledge Processing complements explicit evidence. Even with filtered explicit knowledge, some questions require additional commonsense/domain priors. Our implicit knowledge processing elicits such parametric knowledge conditioned on the selected evidence, enabling correct reasoning and answers.
  • ...and 2 more figures