Table of Contents
Fetching ...

GlobalizeEd: A Multimodal Translation System that Preserves Speaker Identity in Academic Lectures

Hoang-Son Vo, Karina Kolmogortseva, Ngumimi Karen Iyortsuun, Hong-Duyen Vo, Soo-Hyung Kim

TL;DR

GlobalizeEd tackles the challenge of translating academic lectures while preserving speaker identity and enabling tone-appropriate culture-aware delivery. The authors implement a multimodal pipeline that combines segment-wise LLM-driven translation with accent-preserving voice cloning and diffusion-based lip synchronization, augmented by duration alignment and a user-centric interface. In a mixed-methods study with 18 instructors and 18 students, the system reduces cognitive load and enhances engagement compared to traditional subtitles, while maintaining learning effectiveness similar to high-quality subtitles. The work provides a practical, user-centered AI framework for cross-cultural education, demonstrating how linguistic fidelity, cultural adaptability, and user control can yield more inclusive global learning experiences.

Abstract

A large amount of valuable academic content is only available in its original language, creating a significant access barrier for the global student community. This is a challenge for translating in several subjects, such as history, culture, and the arts, where current automated subtitle tools fail to convey the appropriate pedagogical tone and specialized meaning. In addition, reading traditional automated subtitles increases cognitive load and leads to a disconnected learning experience. Through a mixed-methods study involving 36 participants, we found that GlobalizeEds dubbed formats significantly reduce cognitive load and offer a more immersive learning experience compared to traditional subtitles. Although learning effectiveness was comparable between high-quality subtitles and dubbed formats, both groups valued GlobalizeEds ability to preserve the speakers voice, which enhanced perceived authenticity. Instructors rated translation accuracy and vocal naturalness, whereas students reported that synchronized, identity-preserving outputs fostered engagement and trust. This work contributes a novel human-centered AI framework for cross-lingual education, demonstrating how multimodal translation systems can balance linguistic fidelity, cultural adaptability, and user control to create more inclusive global learning experiences.

GlobalizeEd: A Multimodal Translation System that Preserves Speaker Identity in Academic Lectures

TL;DR

GlobalizeEd tackles the challenge of translating academic lectures while preserving speaker identity and enabling tone-appropriate culture-aware delivery. The authors implement a multimodal pipeline that combines segment-wise LLM-driven translation with accent-preserving voice cloning and diffusion-based lip synchronization, augmented by duration alignment and a user-centric interface. In a mixed-methods study with 18 instructors and 18 students, the system reduces cognitive load and enhances engagement compared to traditional subtitles, while maintaining learning effectiveness similar to high-quality subtitles. The work provides a practical, user-centered AI framework for cross-cultural education, demonstrating how linguistic fidelity, cultural adaptability, and user control can yield more inclusive global learning experiences.

Abstract

A large amount of valuable academic content is only available in its original language, creating a significant access barrier for the global student community. This is a challenge for translating in several subjects, such as history, culture, and the arts, where current automated subtitle tools fail to convey the appropriate pedagogical tone and specialized meaning. In addition, reading traditional automated subtitles increases cognitive load and leads to a disconnected learning experience. Through a mixed-methods study involving 36 participants, we found that GlobalizeEds dubbed formats significantly reduce cognitive load and offer a more immersive learning experience compared to traditional subtitles. Although learning effectiveness was comparable between high-quality subtitles and dubbed formats, both groups valued GlobalizeEds ability to preserve the speakers voice, which enhanced perceived authenticity. Instructors rated translation accuracy and vocal naturalness, whereas students reported that synchronized, identity-preserving outputs fostered engagement and trust. This work contributes a novel human-centered AI framework for cross-lingual education, demonstrating how multimodal translation systems can balance linguistic fidelity, cultural adaptability, and user control to create more inclusive global learning experiences.

Paper Structure

This paper contains 41 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The AI system architecture of the GlobalizeEd multimodal translation pipeline. The input video is analyzed to separate its components: the speaker's voice, background audio, and the video stream. The voice is transcribed via Speech-to-Text, and an LLM translates the resulting transcript. The translated transcript is then put through the Text-to-Speech, which mimics the original voice accent to produce a new audio. This new audio and the original video stream are then fed into a lip-sync module to generate synchronized mouth movements. Finally, all components of the background audio, the translated voice, and the lip-synced video are combined to produce the final dubbed video output.
  • Figure 2: The user interface of the GlobalizeEd prototype. The interface is composed of three main panels: (a) a video preview window, (b) an AI Tools panel that guides the user through the step-by-step translation, and lipsync process, and (c) a timeline editor. The timeline view visualizes the separated multimodal tracks, including: video, vocal, background audio, transcript, and the translated transcript fine-grained control and visibility over the process.
  • Figure 3: The technical architecture of GlobalizeEd. The user interacts with the Frontend Layer, which communicates with the Backend API Layer through REST APIs. The backend acts as an orchestrator, calling internal AI services as needed. All components interact with a central Database/Storage layer for data persistence and asset management.
  • Figure 4: Feature Importance Rankings. The violin plots show the distribution of importance scores for five system features, as ranked by the instructors (1=Least Important, 5=Most Important).
  • Figure 5: Teacher Assessment Score Distribution. The violin plots show the distribution of instructor ratings (1-5) for the system across six metrics. The dashed line represents the 4.0 agreement threshold.
  • ...and 2 more figures