LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs
Hanyu Zhou, Gim Hee Lee
TL;DR
The paper addresses the challenge of fine-grained spatiotemporal reasoning in large multimodal models by mitigating frame-based temporal sparsity with high-temporal-resolution event cameras. It introduces LLaFEA, a two-stage framework that fuses frame and event features through cross-attention and self-attention to produce spatiotemporal-dense visual representations, and embeds textual spatiotemporal coordinates into this space to align with language. The authors create synthetic and real frame-event datasets, implement a four-stage training pipeline, and demonstrate significant improvements in spatiotemporal grounding and related tasks, especially under high-dynamic and low-light conditions. This approach enhances the ability of LMMs to interpret scenes at any position and time, offering robust, dense spatiotemporal understanding for practical applications in dynamic environments.
Abstract
Large multimodal models (LMMs) excel in scene understanding but struggle with fine-grained spatiotemporal reasoning due to weak alignment between linguistic and visual representations. Existing methods map textual positions and durations into the visual space encoded from frame-based videos, but suffer from temporal sparsity that limits language-vision temporal coordination. To address this issue, we introduce LLaFEA (Large Language and Frame-Event Assistant) to leverage event cameras for temporally dense perception and frame-event fusion. Our approach employs a cross-attention mechanism to integrate complementary spatial and temporal features, followed by self-attention matching for global spatio-temporal associations. We further embed textual position and duration tokens into the fused visual space to enhance fine-grained alignment. This unified framework ensures robust spatio-temporal coordinate alignment, enabling LMMs to interpret scenes at any position and any time. In addition, we construct a dataset of real-world frames-events with coordinate instructions and conduct extensive experiments to validate the effectiveness of the proposed method.
