Table of Contents
Fetching ...

Towards a Multimodal Document-grounded Conversational AI System for Education

Karan Taneja, Anjali Singh, Ashok K. Goel

TL;DR

This paper presents MuDoC, a multimodal, document-grounded conversational AI for education built on GPT-4o that grounds responses in both text and figures from course documents. It combines a layout-aware document preprocessing pipeline, retrieval-augmented response generation, and an interactive UI that enables source navigation to foster verification and trust. In a within-subject study against a text-only baseline, MuDoC increases learner engagement and perceived trust via visuals and verifiability, though it does not produce a significant improvement in problem-solving performance. The authors discuss implications grounded in multimedia learning and cognitive load theory, highlighting benefits for memory retention and critical thinking while noting cognitive load and response length concerns, and propose future work on personalization, concise responses, and better source attribution to enhance educational impact.

Abstract

Multimedia learning using text and images has been shown to improve learning outcomes compared to text-only instruction. But conversational AI systems in education predominantly rely on text-based interactions while multimodal conversations for multimedia learning remain unexplored. Moreover, deploying conversational AI in learning contexts requires grounding in reliable sources and verifiability to create trust. We present MuDoC, a Multimodal Document-grounded Conversational AI system based on GPT-4o, that leverages both text and visuals from documents to generate responses interleaved with text and images. Its interface allows verification of AI generated content through seamless navigation to the source. We compare MuDoC to a text-only system to explore differences in learner engagement, trust in AI system, and their performance on problem-solving tasks. Our findings indicate that both visuals and verifiability of content enhance learner engagement and foster trust; however, no significant impact in performance was observed. We draw upon theories from cognitive and learning sciences to interpret the findings and derive implications, and outline future directions for the development of multimodal conversational AI systems in education.

Towards a Multimodal Document-grounded Conversational AI System for Education

TL;DR

This paper presents MuDoC, a multimodal, document-grounded conversational AI for education built on GPT-4o that grounds responses in both text and figures from course documents. It combines a layout-aware document preprocessing pipeline, retrieval-augmented response generation, and an interactive UI that enables source navigation to foster verification and trust. In a within-subject study against a text-only baseline, MuDoC increases learner engagement and perceived trust via visuals and verifiability, though it does not produce a significant improvement in problem-solving performance. The authors discuss implications grounded in multimedia learning and cognitive load theory, highlighting benefits for memory retention and critical thinking while noting cognitive load and response length concerns, and propose future work on personalization, concise responses, and better source attribution to enhance educational impact.

Abstract

Multimedia learning using text and images has been shown to improve learning outcomes compared to text-only instruction. But conversational AI systems in education predominantly rely on text-based interactions while multimodal conversations for multimedia learning remain unexplored. Moreover, deploying conversational AI in learning contexts requires grounding in reliable sources and verifiability to create trust. We present MuDoC, a Multimodal Document-grounded Conversational AI system based on GPT-4o, that leverages both text and visuals from documents to generate responses interleaved with text and images. Its interface allows verification of AI generated content through seamless navigation to the source. We compare MuDoC to a text-only system to explore differences in learner engagement, trust in AI system, and their performance on problem-solving tasks. Our findings indicate that both visuals and verifiability of content enhance learner engagement and foster trust; however, no significant impact in performance was observed. We draw upon theories from cognitive and learning sciences to interpret the findings and derive implications, and outline future directions for the development of multimodal conversational AI systems in education.

Paper Structure

This paper contains 14 sections, 4 figures.

Figures (4)

  • Figure 1: MuDoC's Document Pre-processing and Response Generation Pipeline. Details and demo available in supplementary materialfootnote:samples-and-documentation.
  • Figure 2: User Interface Design and Features
  • Figure 3: Participants preferences comparing text-only TexDoC (1) and multimodal MuDoC (2) system for learning experience. Numbers indicate participant count.
  • Figure 4: Perceived 'usefulness' of different MuDoC features sorted by aggregated preferences. Numbers indicate participant count.