MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering

Zhengyuan Zhu; Daniel Lee; Hong Zhang; Sai Sree Harsha; Loic Feujio; Akash Maharaj; Yunyao Li

MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering

Zhengyuan Zhu, Daniel Lee, Hong Zhang, Sai Sree Harsha, Loic Feujio, Akash Maharaj, Yunyao Li

TL;DR

MuRAR tackles the challenge of generating coherent multimodal answers by integrating retrieval-augmented generation with section-level multimodal data retrieval. The framework comprises text answer generation, source-based retrieval, and answer refinement to produce $A_{mm}$ that fuses text with relevant images, tables, and videos. Evaluations on enterprise-style data show that multimodal answers are more useful, readable, and preferred over text-only responses, with GPT-4 delivering superior multimodal content quality. The work demonstrates practical potential for enterprise AI assistants and outlines future directions to broaden modalities, optimize recall, and strengthen UX and safety considerations.

Abstract

Recent advancements in retrieval-augmented generation (RAG) have demonstrated impressive performance in the question-answering (QA) task. However, most previous works predominantly focus on text-based answers. While some studies address multimodal data, they still fall short in generating comprehensive multimodal answers, particularly for explaining concepts or providing step-by-step tutorials on how to accomplish specific goals. This capability is especially valuable for applications such as enterprise chatbots and settings such as customer service and educational systems, where the answers are sourced from multimodal data. In this paper, we introduce a simple and effective framework named MuRAR (Multimodal Retrieval and Answer Refinement). MuRAR enhances text-based answers by retrieving relevant multimodal data and refining the responses to create coherent multimodal answers. This framework can be easily extended to support multimodal answers in enterprise chatbots with minimal modifications. Human evaluation results indicate that multimodal answers generated by MuRAR are more useful and readable compared to plain text answers.

MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering

TL;DR

that fuses text with relevant images, tables, and videos. Evaluations on enterprise-style data show that multimodal answers are more useful, readable, and preferred over text-only responses, with GPT-4 delivering superior multimodal content quality. The work demonstrates practical potential for enterprise AI assistants and outlines future directions to broaden modalities, optimize recall, and strengthen UX and safety considerations.

Abstract

Paper Structure (33 sections, 5 figures, 5 tables)

This paper contains 33 sections, 5 figures, 5 tables.

Introduction
Design of MuRAR
Text Answer Generation
Source-Based Multimodal Retrieval
Source Attribution.
Section-Level Multimodal Data Retrieval.
Multimodal Answer Refinement
User Interface
Data Collection
Multimodal Document Dataset
Multimodal Question Answering Dataset
Human Evaluation
Study Setup
Evaluation Schema
Results
...and 18 more sections

Figures (5)

Figure 1: The architecture of the $$MuRAR framework.
Figure 2: The interface of AI Assistant demonstrating multimodal answers is constructed by combining multimodal data retrieval and answer refinement.
Figure 3: Distribution of score counts for GPT-3.5, GPT-4, and the combined results of GPT-3.5 and GPT-4.
Figure 4: Prompt for text answer generation.
Figure 5: Prompt for multimodal answer refinement.

MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering

TL;DR

Abstract

MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (5)