M-CALLM: Multi-level Context Aware LLM Framework for Group Interaction Prediction
Diana Romero, Xin Gao, Daniel Khalkhali, Salma Elmalaki
TL;DR
This work introduces M-CALLM, a multi-level context aware LLM framework that converts multimodal MR sensor data into hierarchical natural language context to predict near-future group interactions. By encoding individual traits, group structure, and temporal dynamics as NL prompts, the system enables Gemma-2B to forecast sociograms with up to $96\%$ structural similarity for conversation in real-time, achieving sub-$35$ ms latency and outperforming LSTM baselines by a factor of about $3.2\times$. However, the approach exhibits modality-dependent limits (shared attention remains at 0% recall) and brittle performance under autoregressive simulation due to cascading context errors, highlighting the semantic reasoning gap and the need for robustness. The findings advocate a tiered sensor strategy and hybrid architectures that balance semantic capabilities with error-buffering, offering practical guidance for deploying intelligent collaborative sensing in MR while acknowledging privacy and generalization considerations. Overall, the paper demonstrates the potential of semantic-context LLMs for group collaboration analytics and outlines key directions for strengthening resilience in long-horizon predictions.
Abstract
This paper explores how large language models can leverage multi-level contextual information to predict group coordination patterns in collaborative mixed reality environments. We demonstrate that encoding individual behavioral profiles, group structural properties, and temporal dynamics as natural language enables LLMs to break through the performance ceiling of statistical models. We build M-CALLM, a framework that transforms multimodal sensor streams into hierarchical context for LLM-based prediction, and evaluate three paradigms (zero-shot prompting, few-shot learning, and supervised fine-tuning) against statistical baselines across intervention mode (real-time prediction) and simulation mode (autoregressive forecasting) Head-to-head comparison on 16 groups (64 participants, ~25 hours) demonstrates that context-aware LLMs achieve 96% accuracy for conversation prediction, a 3.2x improvement over LSTM baselines, while maintaining sub-35ms latency. However, simulation mode reveals brittleness with 83% degradation due to cascading errors. Deep-dive into modality-specific performance shows conversation depends on temporal patterns, proximity benefits from group structure (+6%), while shared attention fails completely (0% recall), exposing architectural limitations. We hope this work spawns new ideas for building intelligent collaborative sensing systems that balance semantic reasoning capabilities with fundamental constraints.
