Table of Contents
Fetching ...

MuDoC: An Interactive Multimodal Document-grounded Conversational AI System

Karan Taneja, Ashok K. Goel

TL;DR

MuDoC tackles the challenge of long multimodal document understanding by enabling document-grounded dialogue with interleaved text and visuals. It introduces a retrieval-based system built on GPT-4o that preprocesses PDFs into text and image embeddings, and generates grounded responses using both text and image content. A textbook-like UI enables verification via seamless navigation to source text and figures. Preliminary qualitative results show promise in leveraging visuals for explanations, but also reveal issues with figure placement and potential hallucinations when multiple images are involved, guiding future quantitative evaluation and coherence improvements.

Abstract

Multimodal AI is an important step towards building effective tools to leverage multiple modalities in human-AI communication. Building a multimodal document-grounded AI system to interact with long documents remains a challenge. Our work aims to fill the research gap of directly leveraging grounded visuals from documents alongside textual content in documents for response generation. We present an interactive conversational AI agent 'MuDoC' based on GPT-4o to generate document-grounded responses with interleaved text and figures. MuDoC's intelligent textbook interface promotes trustworthiness and enables verification of system responses by allowing instant navigation to source text and figures in the documents. We also discuss qualitative observations based on MuDoC responses highlighting its strengths and limitations.

MuDoC: An Interactive Multimodal Document-grounded Conversational AI System

TL;DR

MuDoC tackles the challenge of long multimodal document understanding by enabling document-grounded dialogue with interleaved text and visuals. It introduces a retrieval-based system built on GPT-4o that preprocesses PDFs into text and image embeddings, and generates grounded responses using both text and image content. A textbook-like UI enables verification via seamless navigation to source text and figures. Preliminary qualitative results show promise in leveraging visuals for explanations, but also reveal issues with figure placement and potential hallucinations when multiple images are involved, guiding future quantitative evaluation and coherence improvements.

Abstract

Multimodal AI is an important step towards building effective tools to leverage multiple modalities in human-AI communication. Building a multimodal document-grounded AI system to interact with long documents remains a challenge. Our work aims to fill the research gap of directly leveraging grounded visuals from documents alongside textual content in documents for response generation. We present an interactive conversational AI agent 'MuDoC' based on GPT-4o to generate document-grounded responses with interleaved text and figures. MuDoC's intelligent textbook interface promotes trustworthiness and enables verification of system responses by allowing instant navigation to source text and figures in the documents. We also discuss qualitative observations based on MuDoC responses highlighting its strengths and limitations.

Paper Structure

This paper contains 8 sections, 3 figures.

Figures (3)

  • Figure 1: Document Preprocessing: PDF document layouts are detected to extract text and image snippets which are processed using OCR, GPT-3/4 and embedding models to create text and image embeddings for retrieval during response generation.
  • Figure 2: Response Generation: GPT-4o tool calls feature is used to create text and image retrieval queries. Outputs from embedding-based retrieval are used for response generation and image references are replaced with actual images.
  • Figure 3: UI Features: Chat-PDF display in (a) shows chat area on left, text box on bottom-left, and PDF on right. Yellow boxes and arrows describe features including summarize, Explain-it-Like-I'm-10 (ELI10), PDF navigation using text and images.