Table of Contents
Fetching ...

An Interactive Multi-modal Query Answering System with Retrieval-Augmented Large Language Models

Mengzhao Wang, Haotian Wu, Xiangyu Ke, Yunjun Gao, Xiaoliang Xu, Lu Chen

TL;DR

The paper addresses the challenge of performing reliable, interactive multi-modal question answering by integrating retrieval-augmented LLMs with a dedicated multi-modal retrieval and indexing framework. It introduces the Interactive Multi-modal Query Answering (MQA) system, built on MUST_ICDE24 and the Starling navigation graph, featuring a contrastive learning-based modality weighting and a pluggable embedding/LLM pipeline. It provides a cohesive five-component architecture with a central coordinator to streamline data flow, plus a demonstrated frontend/backend implementation. The work demonstrates improved accuracy and user-driven adaptability for multi-modal knowledge bases, with scalable indexing and efficient retrieval. This framework holds practical potential for real-time, context-aware QA across text and visual inputs.

Abstract

Retrieval-augmented Large Language Models (LLMs) have reshaped traditional query-answering systems, offering unparalleled user experiences. However, existing retrieval techniques often struggle to handle multi-modal query contexts. In this paper, we present an interactive Multi-modal Query Answering (MQA) system, empowered by our newly developed multi-modal retrieval framework and navigation graph index, integrated with cutting-edge LLMs. It comprises five core components: Data Preprocessing, Vector Representation, Index Construction, Query Execution, and Answer Generation, all orchestrated by a dedicated coordinator to ensure smooth data flow from input to answer generation. One notable aspect of MQA is its utilization of contrastive learning to assess the significance of different modalities, facilitating precise measurement of multi-modal information similarity. Furthermore, the system achieves efficient retrieval through our advanced navigation graph index, refined using computational pruning techniques. Another highlight of our system is its pluggable processing framework, allowing seamless integration of embedding models, graph indexes, and LLMs. This flexibility provides users diverse options for gaining insights from their multi-modal knowledge base. A preliminary video introduction of MQA is available at https://youtu.be/xvUuo2ZIqWk.

An Interactive Multi-modal Query Answering System with Retrieval-Augmented Large Language Models

TL;DR

The paper addresses the challenge of performing reliable, interactive multi-modal question answering by integrating retrieval-augmented LLMs with a dedicated multi-modal retrieval and indexing framework. It introduces the Interactive Multi-modal Query Answering (MQA) system, built on MUST_ICDE24 and the Starling navigation graph, featuring a contrastive learning-based modality weighting and a pluggable embedding/LLM pipeline. It provides a cohesive five-component architecture with a central coordinator to streamline data flow, plus a demonstrated frontend/backend implementation. The work demonstrates improved accuracy and user-driven adaptability for multi-modal knowledge bases, with scalable indexing and efficient retrieval. This framework holds practical potential for real-time, context-aware QA across text and visual inputs.

Abstract

Retrieval-augmented Large Language Models (LLMs) have reshaped traditional query-answering systems, offering unparalleled user experiences. However, existing retrieval techniques often struggle to handle multi-modal query contexts. In this paper, we present an interactive Multi-modal Query Answering (MQA) system, empowered by our newly developed multi-modal retrieval framework and navigation graph index, integrated with cutting-edge LLMs. It comprises five core components: Data Preprocessing, Vector Representation, Index Construction, Query Execution, and Answer Generation, all orchestrated by a dedicated coordinator to ensure smooth data flow from input to answer generation. One notable aspect of MQA is its utilization of contrastive learning to assess the significance of different modalities, facilitating precise measurement of multi-modal information similarity. Furthermore, the system achieves efficient retrieval through our advanced navigation graph index, refined using computational pruning techniques. Another highlight of our system is its pluggable processing framework, allowing seamless integration of embedding models, graph indexes, and LLMs. This flexibility provides users diverse options for gaining insights from their multi-modal knowledge base. A preliminary video introduction of MQA is available at https://youtu.be/xvUuo2ZIqWk.
Paper Structure (3 sections, 5 figures)

This paper contains 3 sections, 5 figures.

Figures (5)

  • Figure 1: An example of multi-modal QA.
  • Figure 2: The overall architecture of MQA.
  • Figure 3: User interface of MQA.
  • Figure 4: Interaction examples.
  • Figure 5: Results from two-round response using different retrieval frameworks, with the user's choice marked in red.