Table of Contents
Fetching ...

Evaluating Time Awareness and Cross-modal Active Perception of Large Models via 4D Escape Room Task

Yurui Dong, Ziyue Wang, Shuyun Lu, Dairu Liu, Xuechen Liu, Fuwen Luo, Peng Li, Yang Liu

Abstract

Multimodal Large Language Models (MLLMs) have recently made rapid progress toward unified Omni models that integrate vision, language, and audio. However, existing environments largely focus on 2D or 3D visual context and vision-language tasks, offering limited support for temporally dependent auditory signals and selective cross-modal integration, where different modalities may provide complementary or interfering information, which are essential capabilities for realistic multimodal reasoning. As a result, whether models can actively coordinate modalities and reason under time-varying, irreversible conditions remains underexplored. To this end, we introduce \textbf{EscapeCraft-4D}, a customizable 4D environment for assessing selective cross-modal perception and time awareness in Omni models. It incorporates trigger-based auditory sources, temporally transient evidence, and location-dependent cues, requiring agents to perform spatio-temporal reasoning and proactive multimodal integration under time constraints. Building on this environment, we curate a benchmark to evaluate corresponding abilities across powerful models. Evaluation results suggest that models struggle with modality bias, and reveal significant gaps in current model's ability to integrate multiple modalities under time constraints. Further in-depth analysis uncovers how multiple modalities interact and jointly influence model decisions in complex multimodal reasoning environments.

Evaluating Time Awareness and Cross-modal Active Perception of Large Models via 4D Escape Room Task

Abstract

Multimodal Large Language Models (MLLMs) have recently made rapid progress toward unified Omni models that integrate vision, language, and audio. However, existing environments largely focus on 2D or 3D visual context and vision-language tasks, offering limited support for temporally dependent auditory signals and selective cross-modal integration, where different modalities may provide complementary or interfering information, which are essential capabilities for realistic multimodal reasoning. As a result, whether models can actively coordinate modalities and reason under time-varying, irreversible conditions remains underexplored. To this end, we introduce \textbf{EscapeCraft-4D}, a customizable 4D environment for assessing selective cross-modal perception and time awareness in Omni models. It incorporates trigger-based auditory sources, temporally transient evidence, and location-dependent cues, requiring agents to perform spatio-temporal reasoning and proactive multimodal integration under time constraints. Building on this environment, we curate a benchmark to evaluate corresponding abilities across powerful models. Evaluation results suggest that models struggle with modality bias, and reveal significant gaps in current model's ability to integrate multiple modalities under time constraints. Further in-depth analysis uncovers how multiple modalities interact and jointly influence model decisions in complex multimodal reasoning environments.
Paper Structure (48 sections, 18 equations, 5 figures, 10 tables)

This paper contains 48 sections, 18 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Our EscapeCraft-4D environment setup. It incorporates auditory, visual, and time-aware clues to evaluate multimodal reasoning. The system is designed to test the abilities of agents to coordinate cross-modal perception, such as leveraging spatially grounded auditory cues and visually presented clues or making time-aware decisions under strict temporal constraints. This figure illustrates the sequencially arranged tasks according to three stages: exploration, time-aware search, and finally escape, with dynamic auditory cues playing central roles in guiding the decision-making process. The trigger system dynamically introduces time-aware clues that challenge the agent to act within limited timeframes, showcasing the integration of auditory and visual modalities.
  • Figure 2: Examples of our designed levels with successful runs from GPT-4o. (a) Basic game design included both auditory and visual clues. (b) Misleading levels included distractors which would play misleading auditory clues. Models should distinguish which auditory clue was genuinely useful and which was the distractor. (C) Time-aware level design included clues only visible in limited time. Models should find these clues in the time limit or the clue would disappear.
  • Figure 3: Path density analysis for different levels in our MM-Escape4D benchmark. Exits of different scenes are aligned as mentioned in Section \ref{['sec:ana_path']} The heatmaps represent the log-normalized density ($\log(1+\text{density})$) of agent positions, with blue gradients for successful runs and red for failed ones. The dashed contours represent the spatial distribution of key items (e.g., recorders for auditory clues and boxes).
  • Figure 4: Analysis of Ambient Sound via Audio and No-Audio settings. (A-B) Path density heatmaps showing trajectory concentrations, using the same spatial aligning method as in Figure \ref{['fig:density']}. (C) Delta distributions for sign-aligned metrics (Steps, Timer, Frechet, PathLen); the red median lines below zero indicate the superiority of the audio condition. (D) The winning rate of audio guidance across nine navigation metrics. Metrics where lower values are better: Steps, Time, PathLen, Turn, and Frechet. Efficiency metrics where higher values are preferred: ProgEff, PathEff, and Mono. (E) Mean distance-to-exit over normalized episode progress; the shaded area represents $\pm 1$ standard deviation. (F) Kaplan-like survival plot showing the percentage of models that have not yet reached the 1.5m exit threshold. Statistical significance for AUC and endpoint success is determined via paired Wilcoxon tests ($*p < 0.05$ indicates a statistically significant difference).
  • Figure 5: Paired HII comparison under auditory conditioning for Difficulty-2 and Difficulty-3. Each row corresponds to one difficulty level. Left column: unpaired success vs fail comparison in audio-active segments. Middle and right columns: paired audio vs non-audio comparison within success and fail runs. Each dot is one run-level $\mathrm{HII}_{\mathrm{norm}}$; lines connect paired points from the same run; error bars show mean $\pm$ standard deviation. The figure shows consistent concentration increase during audio-active periods, while success vs fail separation in audio periods is significant in Difficulty-2 and not significant in Difficulty-3.