Project MOSLA: Recording Every Moment of Second Language Acquisition
Masato Hagiwara, Joshua Tanner
TL;DR
Project MOSLA tackles the lack of longitudinal, multimodal data in second language acquisition by collecting two years of online, instructor‑led lessons in Arabic, Spanish, and Chinese, with over 250 hours of Zoom recordings. The dataset is semi‑automatically annotated through a pipeline that fuses human transcription with machine diarization, language/speaker identification, and ASR using fine‑tuned Whisper models, enabling analyses of language use, lexical development, and multimodal attention. Empirical results show improvements in speaker/language identification and ASR after fine‑tuning, and reveal linguistic trends such as increasing target-language usage over time, as well as the potential to infer screen focus from unannotated audio‑video data via the Matchmap approach. Open access to MOSLA supports SLA research, proficiency estimation, and multimodal learning analytics across pedagogy and language processing domains.
Abstract
Second language acquisition (SLA) is a complex and dynamic process. Many SLA studies that have attempted to record and analyze this process have typically focused on a single modality (e.g., textual output of learners), covered only a short period of time, and/or lacked control (e.g., failed to capture every aspect of the learning process). In Project MOSLA (Moments of Second Language Acquisition), we have created a longitudinal, multimodal, multilingual, and controlled dataset by inviting participants to learn one of three target languages (Arabic, Spanish, and Chinese) from scratch over a span of two years, exclusively through online instruction, and recording every lesson using Zoom. The dataset is semi-automatically annotated with speaker/language IDs and transcripts by both human annotators and fine-tuned state-of-the-art speech models. Our experiments reveal linguistic insights into learners' proficiency development over time, as well as the potential for automatically detecting the areas of focus on the screen purely from the unannotated multimodal data. Our dataset is freely available for research purposes and can serve as a valuable resource for a wide range of applications, including but not limited to SLA, proficiency assessment, language and speech processing, pedagogy, and multimodal learning analytics.
