Table of Contents
Fetching ...

M-CALLM: Multi-level Context Aware LLM Framework for Group Interaction Prediction

Diana Romero, Xin Gao, Daniel Khalkhali, Salma Elmalaki

TL;DR

This work introduces M-CALLM, a multi-level context aware LLM framework that converts multimodal MR sensor data into hierarchical natural language context to predict near-future group interactions. By encoding individual traits, group structure, and temporal dynamics as NL prompts, the system enables Gemma-2B to forecast sociograms with up to $96\%$ structural similarity for conversation in real-time, achieving sub-$35$ ms latency and outperforming LSTM baselines by a factor of about $3.2\times$. However, the approach exhibits modality-dependent limits (shared attention remains at 0% recall) and brittle performance under autoregressive simulation due to cascading context errors, highlighting the semantic reasoning gap and the need for robustness. The findings advocate a tiered sensor strategy and hybrid architectures that balance semantic capabilities with error-buffering, offering practical guidance for deploying intelligent collaborative sensing in MR while acknowledging privacy and generalization considerations. Overall, the paper demonstrates the potential of semantic-context LLMs for group collaboration analytics and outlines key directions for strengthening resilience in long-horizon predictions.

Abstract

This paper explores how large language models can leverage multi-level contextual information to predict group coordination patterns in collaborative mixed reality environments. We demonstrate that encoding individual behavioral profiles, group structural properties, and temporal dynamics as natural language enables LLMs to break through the performance ceiling of statistical models. We build M-CALLM, a framework that transforms multimodal sensor streams into hierarchical context for LLM-based prediction, and evaluate three paradigms (zero-shot prompting, few-shot learning, and supervised fine-tuning) against statistical baselines across intervention mode (real-time prediction) and simulation mode (autoregressive forecasting) Head-to-head comparison on 16 groups (64 participants, ~25 hours) demonstrates that context-aware LLMs achieve 96% accuracy for conversation prediction, a 3.2x improvement over LSTM baselines, while maintaining sub-35ms latency. However, simulation mode reveals brittleness with 83% degradation due to cascading errors. Deep-dive into modality-specific performance shows conversation depends on temporal patterns, proximity benefits from group structure (+6%), while shared attention fails completely (0% recall), exposing architectural limitations. We hope this work spawns new ideas for building intelligent collaborative sensing systems that balance semantic reasoning capabilities with fundamental constraints.

M-CALLM: Multi-level Context Aware LLM Framework for Group Interaction Prediction

TL;DR

This work introduces M-CALLM, a multi-level context aware LLM framework that converts multimodal MR sensor data into hierarchical natural language context to predict near-future group interactions. By encoding individual traits, group structure, and temporal dynamics as NL prompts, the system enables Gemma-2B to forecast sociograms with up to structural similarity for conversation in real-time, achieving sub- ms latency and outperforming LSTM baselines by a factor of about . However, the approach exhibits modality-dependent limits (shared attention remains at 0% recall) and brittle performance under autoregressive simulation due to cascading context errors, highlighting the semantic reasoning gap and the need for robustness. The findings advocate a tiered sensor strategy and hybrid architectures that balance semantic capabilities with error-buffering, offering practical guidance for deploying intelligent collaborative sensing in MR while acknowledging privacy and generalization considerations. Overall, the paper demonstrates the potential of semantic-context LLMs for group collaboration analytics and outlines key directions for strengthening resilience in long-horizon predictions.

Abstract

This paper explores how large language models can leverage multi-level contextual information to predict group coordination patterns in collaborative mixed reality environments. We demonstrate that encoding individual behavioral profiles, group structural properties, and temporal dynamics as natural language enables LLMs to break through the performance ceiling of statistical models. We build M-CALLM, a framework that transforms multimodal sensor streams into hierarchical context for LLM-based prediction, and evaluate three paradigms (zero-shot prompting, few-shot learning, and supervised fine-tuning) against statistical baselines across intervention mode (real-time prediction) and simulation mode (autoregressive forecasting) Head-to-head comparison on 16 groups (64 participants, ~25 hours) demonstrates that context-aware LLMs achieve 96% accuracy for conversation prediction, a 3.2x improvement over LSTM baselines, while maintaining sub-35ms latency. However, simulation mode reveals brittleness with 83% degradation due to cascading errors. Deep-dive into modality-specific performance shows conversation depends on temporal patterns, proximity benefits from group structure (+6%), while shared attention fails completely (0% recall), exposing architectural limitations. We hope this work spawns new ideas for building intelligent collaborative sensing systems that balance semantic reasoning capabilities with fundamental constraints.

Paper Structure

This paper contains 37 sections, 1 equation, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Context-aware LLM system architecture. Multimodal MR sensor data (gaze, audio, location, task state) are processed into sociograms representing conversation, proximity, and shared attention networks. Hierarchical context including individual behavioral profiles, group structural properties, and temporal dynamics is encoded as natural language and provided to Gemma-2B for next-window prediction via zero-shot, few-shot ($k=1$), or fine-tuned (LoRA) approaches. The system achieves sub-35ms TTFT with 7.6x richer context than minimal baselines, enabling real-time deployment on consumer hardware.
  • Figure 2: Prompt structure for group interaction prediction. Key components include participant behavioral profiles, group structural metrics, network metrics with PCA weights, temporal phase stability, recent pairwise history, few-shot examples (highlighted in green), and task instructions. The example section (green) is omitted in the zero-shot condition.
  • Figure 3: Statistical model context plateau. LSTM performance remains constant ($\sim$29% sociogram similarity) across four context configurations: individual-only, individual+group, individual+temporal, and full multi-level context. While sensor-level accuracy improves with richer context (left), sociogram similarity does not (right), revealing statistical models' inability to capture emergent coordination patterns.
  • Figure 4: LSTM weighted Jaccard across context configurations. Performance plateaus around 29% regardless of context complexity. The flat trend demonstrates architectural limitations: statistical models cannot leverage rich contextual information to improve group coordination prediction.