Table of Contents
Fetching ...

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Zhixin Zhang, Yiyuan Zhang, Xiaohan Ding, Xiangyu Yue

TL;DR

The Vision Search Assistant is proposed, a novel framework that facilitates collaboration between VLMs and web agents that leverages VLMs' visual understanding capabilities and web agents' real-time information access to perform open-world Retrieval-Augmented Generation via the web.

Abstract

Search engines enable the retrieval of unknown information with texts. However, traditional methods fall short when it comes to understanding unfamiliar visual content, such as identifying an object that the model has never seen before. This challenge is particularly pronounced for large vision-language models (VLMs): if the model has not been exposed to the object depicted in an image, it struggles to generate reliable answers to the user's question regarding that image. Moreover, as new objects and events continuously emerge, frequently updating VLMs is impractical due to heavy computational burdens. To address this limitation, we propose Vision Search Assistant, a novel framework that facilitates collaboration between VLMs and web agents. This approach leverages VLMs' visual understanding capabilities and web agents' real-time information access to perform open-world Retrieval-Augmented Generation via the web. By integrating visual and textual representations through this collaboration, the model can provide informed responses even when the image is novel to the system. Extensive experiments conducted on both open-set and closed-set QA benchmarks demonstrate that the Vision Search Assistant significantly outperforms the other models and can be widely applied to existing VLMs.

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

TL;DR

The Vision Search Assistant is proposed, a novel framework that facilitates collaboration between VLMs and web agents that leverages VLMs' visual understanding capabilities and web agents' real-time information access to perform open-world Retrieval-Augmented Generation via the web.

Abstract

Search engines enable the retrieval of unknown information with texts. However, traditional methods fall short when it comes to understanding unfamiliar visual content, such as identifying an object that the model has never seen before. This challenge is particularly pronounced for large vision-language models (VLMs): if the model has not been exposed to the object depicted in an image, it struggles to generate reliable answers to the user's question regarding that image. Moreover, as new objects and events continuously emerge, frequently updating VLMs is impractical due to heavy computational burdens. To address this limitation, we propose Vision Search Assistant, a novel framework that facilitates collaboration between VLMs and web agents. This approach leverages VLMs' visual understanding capabilities and web agents' real-time information access to perform open-world Retrieval-Augmented Generation via the web. By integrating visual and textual representations through this collaboration, the model can provide informed responses even when the image is novel to the system. Extensive experiments conducted on both open-set and closed-set QA benchmarks demonstrate that the Vision Search Assistant significantly outperforms the other models and can be widely applied to existing VLMs.

Paper Structure

This paper contains 11 sections, 7 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Vision Search Assistant acquires unknown visual knowledge through web search. An intuitive comparison of answering the user's question with an unseen image. The proposed Vision Search Assistant is developed based on LLaVA-1.6-7B, and its ability to answer the question on unseen images outperforms the state-of-the-art models including LLava-1.6-34B liu2023improved, Qwen2-VL-72B Qwen-VL, and InternVL2-76B chen2024far.
  • Figure 2: Comparsion with Closed-Source Models including GPT-4o gpt4o, Gemini reid2024gemini, Claude 3.5 Sonnet claude3 with Vision Search Assistant shows that Vision Search Assistant satisfies users' needs better even if the image is novel.
  • Figure 3: Overview of Vision Search Assistant. We first identify the critical objects and generate their descriptions considering their correlations, named Correlated Formulation, using the Vision Language Model (VLM). We then use the LLM to generate sub-questions that leads to the final answer, which is referred to as the Planning Agent. The web pages returned from the search engine are analyzed, selected, and summarized by the same LLM, which is referred to as the Searching Agent. We use the original image, the user's prompt, the Correlated Formulation together with the obtained web knowledge to generate the final answer. Vision Search Assistant produces reliable answers, even for novel images, by leveraging the collaboration between VLM and web agents to gather visual information from the web effectively.
  • Figure 4: The Chain of Search algorithm (§ \ref{['sec:vsa:agent']}). We deduce the update of the directed graph when $k=1, 2, \cdots$, and the web knowledge is progressively extracted from each update.
  • Figure 5: Open-Set Evaluation: We conduct a human expert evaluation on open-set QA tasks. Vision Search Assistant significantly outperformed Perplexity.ai Pro and GPT-4o-Web across three key objectives: factuality, relevance, and supportiveness.
  • ...and 5 more figures