Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

Minghan Wang; Yuxia Wang; Thuy-Trang Vu; Ehsan Shareghi; Gholamreza Haffari

Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

Minghan Wang, Yuxia Wang, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari

TL;DR

The Scientific Vision Augmented ASR (SciVASR) framework is proposed as a baseline method, enabling MLLMs to improve transcript quality through post-editing, and shows a 45% improvement over speech-only baselines, highlighting the importance of multimodal information integration.

Abstract

Recent advancements in multimodal large language models (MLLMs) have made significant progress in integrating information across various modalities, yet real-world applications in educational and scientific domains remain challenging. This paper introduces the Multimodal Scientific ASR (MS-ASR) task, which focuses on transcribing scientific conference videos by leveraging visual information from slides to enhance the accuracy of technical terminologies. Realized that traditional metrics like WER fall short in assessing performance accurately, prompting the proposal of severity-aware WER (SWER) that considers the content type and severity of ASR errors. We propose the Scientific Vision Augmented ASR (SciVASR) framework as a baseline method, enabling MLLMs to improve transcript quality through post-editing. Evaluations of state-of-the-art MLLMs, including GPT-4o, show a 45% improvement over speech-only baselines, highlighting the importance of multimodal information integration.

Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

TL;DR

Abstract

Paper Structure (51 sections, 1 equation, 7 figures, 7 tables)

This paper contains 51 sections, 1 equation, 7 figures, 7 tables.

Introduction
Multimodal Scientific ASR
Task Formulation
ACL60/60 Dataset
Scientific Vision-Augmented ASR
Speech Recognition
Video Segmentation
Slide Analysis
Visual Context Extraction
Visual Context Condensement
Post-editing
End-to-end Vision Post-editing
Experiments
Models
ASR Models
...and 36 more sections

Figures (7)

Figure 1: This figure illustrates the importance of visual information in accurately recognizing the terminology BERT. It also introduces our proposed evaluation metrics Severity-aware WER (SWER) by calibrating WER with the severity of ASR errors.
Figure 2: The architecture of SciVASR.Baseline ASR model transcribes a presentation recording into a transcript. Video segmenter splits video frames into a sequence of scenes with each presenting a slide. Slide analyzer applies multimodal LLMs to extract key information from scene images into visual contexts represented by text. Post-editor instruct a text-LLM to leverage visual contexts to refine the ASR transcript.
Figure 3: The win, tie and lose ratio between different settings by human annotations, based on Whisper-base and SciVASR in open-source setup.
Figure 4: An example showing that vision-LLM could answer with hallucination when the question has false-premise. Image is sampled from video 410, CogAgent-VQA is used for inference.
Figure 5: This figure shows the prompt for post-editing with text-LLM.
...and 2 more figures

Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

TL;DR

Abstract

Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

Authors

TL;DR

Abstract

Table of Contents

Figures (7)