Table of Contents
Fetching ...

LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs

Hanyu Zhou, Gim Hee Lee

TL;DR

The paper addresses the challenge of fine-grained spatiotemporal reasoning in large multimodal models by mitigating frame-based temporal sparsity with high-temporal-resolution event cameras. It introduces LLaFEA, a two-stage framework that fuses frame and event features through cross-attention and self-attention to produce spatiotemporal-dense visual representations, and embeds textual spatiotemporal coordinates into this space to align with language. The authors create synthetic and real frame-event datasets, implement a four-stage training pipeline, and demonstrate significant improvements in spatiotemporal grounding and related tasks, especially under high-dynamic and low-light conditions. This approach enhances the ability of LMMs to interpret scenes at any position and time, offering robust, dense spatiotemporal understanding for practical applications in dynamic environments.

Abstract

Large multimodal models (LMMs) excel in scene understanding but struggle with fine-grained spatiotemporal reasoning due to weak alignment between linguistic and visual representations. Existing methods map textual positions and durations into the visual space encoded from frame-based videos, but suffer from temporal sparsity that limits language-vision temporal coordination. To address this issue, we introduce LLaFEA (Large Language and Frame-Event Assistant) to leverage event cameras for temporally dense perception and frame-event fusion. Our approach employs a cross-attention mechanism to integrate complementary spatial and temporal features, followed by self-attention matching for global spatio-temporal associations. We further embed textual position and duration tokens into the fused visual space to enhance fine-grained alignment. This unified framework ensures robust spatio-temporal coordinate alignment, enabling LMMs to interpret scenes at any position and any time. In addition, we construct a dataset of real-world frames-events with coordinate instructions and conduct extensive experiments to validate the effectiveness of the proposed method.

LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs

TL;DR

The paper addresses the challenge of fine-grained spatiotemporal reasoning in large multimodal models by mitigating frame-based temporal sparsity with high-temporal-resolution event cameras. It introduces LLaFEA, a two-stage framework that fuses frame and event features through cross-attention and self-attention to produce spatiotemporal-dense visual representations, and embeds textual spatiotemporal coordinates into this space to align with language. The authors create synthetic and real frame-event datasets, implement a four-stage training pipeline, and demonstrate significant improvements in spatiotemporal grounding and related tasks, especially under high-dynamic and low-light conditions. This approach enhances the ability of LMMs to interpret scenes at any position and time, offering robust, dense spatiotemporal understanding for practical applications in dynamic environments.

Abstract

Large multimodal models (LMMs) excel in scene understanding but struggle with fine-grained spatiotemporal reasoning due to weak alignment between linguistic and visual representations. Existing methods map textual positions and durations into the visual space encoded from frame-based videos, but suffer from temporal sparsity that limits language-vision temporal coordination. To address this issue, we introduce LLaFEA (Large Language and Frame-Event Assistant) to leverage event cameras for temporally dense perception and frame-event fusion. Our approach employs a cross-attention mechanism to integrate complementary spatial and temporal features, followed by self-attention matching for global spatio-temporal associations. We further embed textual position and duration tokens into the fused visual space to enhance fine-grained alignment. This unified framework ensures robust spatio-temporal coordinate alignment, enabling LMMs to interpret scenes at any position and any time. In addition, we construct a dataset of real-world frames-events with coordinate instructions and conduct extensive experiments to validate the effectiveness of the proposed method.

Paper Structure

This paper contains 14 sections, 3 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Illustration of two LMM paradigms for fine-grained spatiotemporal understanding. This task relies on aligning linguistic and visual representations in spatiotemporal coordinates. Frame-only LMMs embed these coordinates into visual tokens from frame videos, but suffer from temporal sparsity that weakens alignment. We incorporate a temporally dense event camera to enhance spatiotemporal representation for LMMs to interpret scenes more accurately at any position and time.
  • Figure 2: Our LLaFEA comprises visual frame-event spatiotemporal fusion and language-vision coordinate alignment. The first stage fuses spatial-dense, temporal-sparse frame features with spatial-sparse, temporal-dense event features to generate spatiotemporal-dense visual tokens. The second stage embeds spatiotemporal coordinate tokens from the language embedding into the fused visual tokens, which are then processed by the LLM for fine-grained understanding.
  • Figure 3: Visual imaging and feature representation of frames and events. In visual imaging, frames capture a spatially dense and temporally sparse global appearance. Events detect spatially sparse and temporally dense local boundaries. In feature representation, spatial and temporal feature distributions exhibit strong similarities between the two modalities.
  • Figure 4: Visual comparison of LMMs on fine-grained understanding in high-dynamic scenes. Red fonts highlight incorrect results.
  • Figure 5: Visual comparison of LMMs on spatiotemporal understanding in low-light scenes. Red fonts highlight incorrect results.
  • ...and 1 more figures