SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge

Chuanhao Li; Zhen Li; Chenchen Jing; Shuo Liu; Wenqi Shao; Yuwei Wu; Ping Luo; Yu Qiao; Kaipeng Zhang

SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge

Chuanhao Li, Zhen Li, Chenchen Jing, Shuo Liu, Wenqi Shao, Yuwei Wu, Ping Luo, Yu Qiao, Kaipeng Zhang

TL;DR

This work tackles the challenge of LVLMs lacking up-to-date knowledge by introducing SearchLVLMs, a plug-and-play framework that retrieves current internet content during inference and filters it through a hierarchical content-selection pipeline. A dedicated UDK-VQA dataset is constructed from automatic news sampling and a multi-LVLM pseudo-score scheme to train and evaluate the framework. Empirical results across 15 LVLMs show substantial accuracy gains on up-to-date VQA, including outperforming self-contained IAG-capable GPT-4V by roughly 25% on UDK-VQA, while remaining adaptable without model fine-tuning. The approach offers a practical, generalizable method to integrate fresh knowledge into LVLMs, with clear guidance on content filtering, diversity, and evaluation for real-world deployment.

Abstract

Large vision-language models (LVLMs) are ignorant of the up-to-date knowledge, such as LLaVA series, because they cannot be updated frequently due to the large amount of resources required, and therefore fail in many cases. For example, if a LVLM was released on January 2024, and it wouldn't know the singer of the theme song for the new Detective Conan movie, which wasn't released until April 2024. To solve the problem, a promising solution motivated by retrieval-augmented generation (RAG) is to provide LVLMs with up-to-date knowledge via internet search during inference, i.e., internet-augmented generation (IAG), which is already integrated in some closed-source commercial LVLMs such as GPT-4V. However, the specific mechanics underpinning them remain a mystery. In this paper, we propose a plug-and-play framework, for augmenting existing LVLMs in handling visual question answering (VQA) about up-to-date knowledge, dubbed SearchLVLMs. A hierarchical filtering model is trained to effectively and efficiently find the most helpful content from the websites returned by a search engine to prompt LVLMs with up-to-date knowledge. To train the model and evaluate our framework's performance, we propose a pipeline to automatically generate news-related VQA samples to construct a dataset, dubbed UDK-VQA. A multi-model voting mechanism is introduced to label the usefulness of website/content for VQA samples to construct the training set. Experimental results demonstrate the effectiveness of our framework, outperforming GPT-4V by about 25% in accuracy.

SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge

TL;DR

Abstract

Paper Structure (26 sections, 6 figures, 5 tables)

This paper contains 26 sections, 6 figures, 5 tables.

Introduction
Related Work
Retrieval-Augmented Generation
Large Models with Search Engine
SearchLVLMs Framework
Query Generator
Search Engine
Hierarchical Filtering Model
Augmented Generation
UDK-VQA Dataset
Query Collection
Question Generation
Image Assignment
Pseudo-Score Generation
Manual Screening
...and 11 more sections

Figures (6)

Figure 1: The proposed SearchLVLMs, a framework for LVLMs to access up-to-date knowledge.
Figure 2: Overall pipeline of the sample generation for the UDK-VQA dataset. For brevity, we only show one output item at several steps, such as the content segment returned by the Parser. Notably, we use queries from different time periods to scrape news from different time periods to generate training samples and test samples, which is not reflected in this figure for brevity.
Figure 3: (a) Training samples. (b) Test samples. (c) Category statistics for the test set of UDK-VQA.
Figure 4: Accuracy using different LVLMs to generate pseudo-scores.
Figure 5: Comparison between Top-$K$ selection and diversity selection (Div-$K$), where $K$ denotes the number of stitched content segments for prompting LVLMs. For each sub-figure, the horizontal coordinate is $K$ and the vertical coordinate is the accuracy. Note that an accuracy of $0$ means that the model fails at the context length under the current setting of $K$, and is labeled as a triangle.
...and 1 more figures

SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge

TL;DR

Abstract

SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge

Authors

TL;DR

Abstract

Table of Contents

Figures (6)