Table of Contents
Fetching ...

BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion

Sai Koneru, Fabian Retkowski, Christian Huber, Lukas Hilgert, Seymanur Akti, Enes Yavuz Ugan, Alexander Waibel, Jan Niehues

TL;DR

BOOM addresses the localization of multimodal lectures by translating audio and slides into multiple languages while preserving visual layout and providing synchronized outputs. It extends OmniFusion-based multimodal speech translation with a slide-translation pipeline (OCR, layout analysis, inpainting, drawing) to render translated slides. It demonstrates that incorporating slide visuals improves ST quality and benefits downstream tasks like summarization and QA, using VISTRA and MCIF benchmarks. The work releases an open-source pipeline and highlights potential for accessible, multilingual education.

Abstract

The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present \textbf{BOOM}, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at https://github.com/saikoneru/image-translator and integrate it in Lecture Translator at https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline}\footnote{All released code and models are licensed under the MIT License.

BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion

TL;DR

BOOM addresses the localization of multimodal lectures by translating audio and slides into multiple languages while preserving visual layout and providing synchronized outputs. It extends OmniFusion-based multimodal speech translation with a slide-translation pipeline (OCR, layout analysis, inpainting, drawing) to render translated slides. It demonstrates that incorporating slide visuals improves ST quality and benefits downstream tasks like summarization and QA, using VISTRA and MCIF benchmarks. The work releases an open-source pipeline and highlights potential for accessible, multilingual education.

Abstract

The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present \textbf{BOOM}, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at https://github.com/saikoneru/image-translator and integrate it in Lecture Translator at https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline}\footnote{All released code and models are licensed under the MIT License.

Paper Structure

This paper contains 23 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Comparison of the English (original) and German (translated) slides. Text outside the images is translated with a unimodal system for efficiency, while text inside the images is translated using a multimodal system.
  • Figure 2: Overview of the image translator pipeline. Arrows indicate the inputs to each step. All steps are model-based except for drawing, which uses heuristic rules.
  • Figure 3: Example illustrating that our Image Translator uses context for disambiguation. The word "Exit" can mean "Ausgang" in the context of a pedestrian exit and "Ausfahrt" in the context of a car exit. Our translator correctly leverages the visual context to produce different translations, even when the source text is identical in both scenarios.
  • Figure 4: Translations of the YouTube video “Richard Feynman: Can Machines Think?” (https://www.youtube.com/watch?v=ipRvjS7q1DI). Subfigure (a) shows the English version; subfigure (b) shows the German version.
  • Figure 5: Summarization and Question Answering user interface. The summaries are shown for each chapter in all languages.
  • ...and 3 more figures