Table of Contents
Fetching ...

Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review

Clayton Cohn, Eduardo Davalos, Caleb Vatral, Joyce Horn Fonteles, Hanchen David Wang, Austin Coursey, Surya Rayala, Ashwin T S, Meiyi Ma, Gautam Biswas

TL;DR

The paper conducts a systematic review of applied multimodal learning analytics (MMLA) in authentic learning and training environments, revealing how real-world constraints shape data collection, fusion, and analysis. It introduces a four-component framework—Environment, Multimodal Data, Learning Analytics, and Feedback—and a five-modality taxonomy to structure empirical methods, with a clear post-LLM shift toward GenAI-enabled analysis and interpretive support. Three archetypes (Designing and Developing Methods, Analyzing Outcomes, Exploring Behaviors) anchor the synthesis, each illustrated by case studies that demonstrate the practical configuration of sensing, fusion, and analytics. The study highlights enduring challenges in data quality, cross-modal alignment, and interpretability, while outlining future directions for active, longitudinal, and standardized MMLA with agentic capabilities to improve educational impact. Overall, the framework and insights aim to guide researchers and practitioners in designing scalable, explanation-rich, and pedagogy-aligned multimodal learning systems.

Abstract

Recent technological advancements in multimodal machine learning--including the rise of large language models (LLMs)--have improved our ability to collect, process, and analyze diverse multimodal data such as speech, video, and eye gaze in learning and training contexts. While prior reviews have addressed individual components of the multimodal pipeline (e.g., conceptual models, data fusion), a comprehensive review of empirical methods in applied multimodal environments remains notably absent. This review addresses that, introducing a taxonomy and framework that capture both established practices and recent innovations driven by LLMs and generative AI. We identify five modality groups: Natural Language, Vision, Physiological Signals, Human-Centered Evidence, and Environment Logs. Our analysis reveals that integrating modalities enables richer insights into learner and trainee behaviors, revealing latent patterns often overlooked by unimodal approaches. However, persistent challenges in multimodal data collection and integration continue to hinder the adoption of these systems in real-time classroom settings.

Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review

TL;DR

The paper conducts a systematic review of applied multimodal learning analytics (MMLA) in authentic learning and training environments, revealing how real-world constraints shape data collection, fusion, and analysis. It introduces a four-component framework—Environment, Multimodal Data, Learning Analytics, and Feedback—and a five-modality taxonomy to structure empirical methods, with a clear post-LLM shift toward GenAI-enabled analysis and interpretive support. Three archetypes (Designing and Developing Methods, Analyzing Outcomes, Exploring Behaviors) anchor the synthesis, each illustrated by case studies that demonstrate the practical configuration of sensing, fusion, and analytics. The study highlights enduring challenges in data quality, cross-modal alignment, and interpretability, while outlining future directions for active, longitudinal, and standardized MMLA with agentic capabilities to improve educational impact. Overall, the framework and insights aim to guide researchers and practitioners in designing scalable, explanation-rich, and pedagogy-aligned multimodal learning systems.

Abstract

Recent technological advancements in multimodal machine learning--including the rise of large language models (LLMs)--have improved our ability to collect, process, and analyze diverse multimodal data such as speech, video, and eye gaze in learning and training contexts. While prior reviews have addressed individual components of the multimodal pipeline (e.g., conceptual models, data fusion), a comprehensive review of empirical methods in applied multimodal environments remains notably absent. This review addresses that, introducing a taxonomy and framework that capture both established practices and recent innovations driven by LLMs and generative AI. We identify five modality groups: Natural Language, Vision, Physiological Signals, Human-Centered Evidence, and Environment Logs. Our analysis reveals that integrating modalities enables richer insights into learner and trainee behaviors, revealing latent patterns often overlooked by unimodal approaches. However, persistent challenges in multimodal data collection and integration continue to hinder the adoption of these systems in real-time classroom settings.
Paper Structure (68 sections, 9 figures, 13 tables, 1 algorithm)

This paper contains 68 sections, 9 figures, 13 tables, 1 algorithm.

Figures (9)

  • Figure 1: Distribution of full corpus papers by year. Blue bars represent works published prior to the release of ChatGPT in November 2022 (Corpus A); orange bars represent those published afterward (Corpus B). Number of Papers refers to the number of papers selected for this review.
  • Figure 2: Multimodal Learning and Training Environments Literature Review Framework
  • Figure 3: Learning-Training Continuum
  • Figure 4: Corpus A data collection media distribution.
  • Figure 5: Corpus B data collection media distribution.
  • ...and 4 more figures