Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering

Dongze Hao; Qunbo Wang; Longteng Guo; Jie Jiang; Jing Liu

Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering

Dongze Hao, Qunbo Wang, Longteng Guo, Jie Jiang, Jing Liu

TL;DR

The paper addresses the challenge of knowledge-intensive visual question answering by introducing a self-bootstrapped, two-module framework that uses the LVLM as a knowledge selector and an independent Answerer. Knowledge is first retrieved with Dense Passage Retrieval, then the Selector identifies key documents for the Answerer to reason over, with cycle training where the Answerer and Selector iteratively improve through pseudo-labeling and weak supervision. The approach achieves state-of-the-art performance on OK-VQA (62.83% accuracy) while fine-tuning only 0.16% of parameters via LoRA, demonstrating strong efficiency and effectiveness. This work highlights the value of integrating retrieval-augmented knowledge with LVLMs through a cooperative, self-improving architecture and sets a new benchmark for open-domain knowledge-based VQA. Its practical impact lies in enabling more accurate, knowledge-driven multimodal reasoning with minimal parameter overhead in real-world applications.

Abstract

While large visual-language models (LVLM) have shown promising results on traditional visual question answering benchmarks, it is still challenging for them to answer complex VQA problems which requires diverse world knowledge. Motivated by the research of retrieval-augmented generation in the field of natural language processing, we use Dense Passage Retrieval (DPR) to retrieve related knowledge to help the model answer questions. However, DPR conduct retrieving in natural language space, which may not ensure comprehensive acquisition of image information. Thus, the retrieved knowledge is not truly conducive to helping answer the question, affecting the performance of the overall system. To address this issue, we propose a novel framework that leverages the visual-language model to select the key knowledge retrieved by DPR and answer questions. The framework consists of two modules: Selector and Answerer, where both are initialized by the LVLM and parameter-efficiently finetuned by self-bootstrapping: find key knowledge in the retrieved knowledge documents using the Selector, and then use them to finetune the Answerer to predict answers; obtain the pseudo-labels of key knowledge documents based on the predictions of the Answerer and weak supervision labels, and then finetune the Selector to select key knowledge; repeat. Our framework significantly enhances the performance of the baseline on the challenging open-domain Knowledge-based VQA benchmark, OK-VQA, achieving a state-of-the-art accuracy of 62.83%. Our code is publicly available at https://github.com/haodongze/Self-KSel-QAns.

Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering

TL;DR

Abstract

Paper Structure (35 sections, 6 equations, 2 figures, 11 tables, 1 algorithm)

This paper contains 35 sections, 6 equations, 2 figures, 11 tables, 1 algorithm.

Introduction
Related work
Large Visual-Language Models.
Knowledge-based VQA.
Method
Preliminaries
Knowledge Retrieval.
Large Visual-Language Model.
Selector and Answerer
Selector.
Answerer.
Self-Bootstrap Learning
Answerer Training.
Selector Training.
Experiments
...and 20 more sections

Figures (2)

Figure 1: Our framework consists of two modules: a Selector and an Answerer. Selector (left) selects the top-T knowledge documents for the Answerer (right), and the Answerer focuses on important knowledge information to predict answers. Both modules utilize the same frozen visual module to extract image features. We train the fully connected (FC) layer and fine-tune the language model using LoRA, which amounts to only 0.16% of the total parameters. For detailed training procedures of the two modules, refer to Alg. \ref{['alg:alg1']}. The original knowledge is retrieved using DPR, and for brevity, we omit the retrieval process here (details can be found in Section \ref{['pre']}).
Figure 2: Qualitative results on the test split of OK-VQA. We compared our method with a model that fine-tunes BLIP2 with knowledge ranked by DPR. The middle segment of the graph represents knowledge from various methods used to answer questions. On the right side of the graph, different answers are depicted when using distinct knowledge. Green and red colors indicate whether the selected final answer is correct.

Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering

TL;DR

Abstract

Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (2)