Table of Contents
Fetching ...

HippoMM: Hippocampal-inspired Multimodal Memory for Long Audiovisual Event Understanding

Yueqian Lin, Qinsi Wang, Hancheng Ye, Yuzhe Fu, Hai "Helen" Li, Yiran Chen

TL;DR

HippoMM introduces a hippocampus-inspired architecture for long-form multimodal memory, translating pattern separation, pattern completion, memory consolidation, and cross-modal retrieval into an algorithmic framework. It forms episodic memories from continuous audiovisual streams, consolidates them into semantic summaries, and uses a hierarchical retrieval pipeline to answer complex queries efficiently. The HippoVlog benchmark demonstrates state-of-the-art accuracy (78.2%) and favorable latency (20.4s), with ablations highlighting the necessity of each memory mechanism and the retrieval strategy. The work advances multimodal understanding by combining neuro-inspired memory primitives with modern LLM-based reasoning, offering a scalable path toward human-like memory-enabled AI systems for long-form AV data.

Abstract

Comprehending extended audiovisual experiences remains a fundamental challenge for computational systems. Current approaches struggle with temporal integration and cross-modal associations that humans accomplish effortlessly through hippocampal-cortical networks. We introduce HippoMM, a biologically-inspired architecture that transforms hippocampal mechanisms into computational advantages for multimodal understanding. HippoMM implements three key innovations: (i) hippocampus-inspired pattern separation and completion specifically designed for continuous audiovisual streams, (ii) short-to-long term memory consolidation that transforms perceptual details into semantic abstractions, and (iii) cross-modal associative retrieval pathways enabling modality-crossing queries. Unlike existing retrieval systems with static indexing schemes, HippoMM dynamically forms integrated episodic representations through adaptive temporal segmentation and dual-process memory encoding. Evaluations on our challenging HippoVlog benchmark demonstrate that HippoMM significantly outperforms state-of-the-art approaches (78.2% vs. 64.2% accuracy) while providing substantially faster response times (20.4s vs. 112.5s). Our results demonstrate that translating neuroscientific memory principles into computational architectures provides a promising foundation for next-generation multimodal understanding systems. The code and benchmark dataset are publicly available at https://github.com/linyueqian/HippoMM.

HippoMM: Hippocampal-inspired Multimodal Memory for Long Audiovisual Event Understanding

TL;DR

HippoMM introduces a hippocampus-inspired architecture for long-form multimodal memory, translating pattern separation, pattern completion, memory consolidation, and cross-modal retrieval into an algorithmic framework. It forms episodic memories from continuous audiovisual streams, consolidates them into semantic summaries, and uses a hierarchical retrieval pipeline to answer complex queries efficiently. The HippoVlog benchmark demonstrates state-of-the-art accuracy (78.2%) and favorable latency (20.4s), with ablations highlighting the necessity of each memory mechanism and the retrieval strategy. The work advances multimodal understanding by combining neuro-inspired memory primitives with modern LLM-based reasoning, offering a scalable path toward human-like memory-enabled AI systems for long-form AV data.

Abstract

Comprehending extended audiovisual experiences remains a fundamental challenge for computational systems. Current approaches struggle with temporal integration and cross-modal associations that humans accomplish effortlessly through hippocampal-cortical networks. We introduce HippoMM, a biologically-inspired architecture that transforms hippocampal mechanisms into computational advantages for multimodal understanding. HippoMM implements three key innovations: (i) hippocampus-inspired pattern separation and completion specifically designed for continuous audiovisual streams, (ii) short-to-long term memory consolidation that transforms perceptual details into semantic abstractions, and (iii) cross-modal associative retrieval pathways enabling modality-crossing queries. Unlike existing retrieval systems with static indexing schemes, HippoMM dynamically forms integrated episodic representations through adaptive temporal segmentation and dual-process memory encoding. Evaluations on our challenging HippoVlog benchmark demonstrate that HippoMM significantly outperforms state-of-the-art approaches (78.2% vs. 64.2% accuracy) while providing substantially faster response times (20.4s vs. 112.5s). Our results demonstrate that translating neuroscientific memory principles into computational architectures provides a promising foundation for next-generation multimodal understanding systems. The code and benchmark dataset are publicly available at https://github.com/linyueqian/HippoMM.

Paper Structure

This paper contains 56 sections, 15 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Conceptual overview of hippocampal versus HippoMM multimodal processing. (Top) Biological hippocampus integrates visual and auditory modalities through the entorhinal cortex and associated circuits (DG, CA3, CA1) to form and recall episodic memories. (Bottom) Proposed HippoMM architecture processes multimodal inputs (video, audio, cross-modal) inspired by hippocampal principles for episodic memory formation and cue-driven retrieval.
  • Figure 2: The HippoMM architecture for multimodal memory. (a) Memory Formation: composed of Temporal Pattern Separation ($\mathcal{S}_t$) based on perceptual boundaries, Perceptual Encoding of visual and auditory inputs with cross-modal features, Memory Consolidation using similarity-based filtering ($K$), and Semantic Replay generating ThetaEvent representations ($\theta$). (b) Memory Retrieval: implementing Query-Driven Pattern Completion through fast retrieval ($\Phi_{\text{fast}}$) and detailed recall pathways with temporal window localization ($\mathbf{W}$).
  • Figure 3: Visualization of consolidated ThetaEvent embedding space and query retrieval via t-SNE projection. (Left) Embeddings for visual features, auditory features, captions, and transcriptions in one event. The semantic summary ($\mathbf{S}_\theta$, green star) is central. (Right) Text query embeddings ('dorm', 'football' - stars) retrieve closest caption/transcription embeddings (crosses), linking queries to specific multimodal segments (corresponding frames and text shown).
  • Figure 4: Retrieval pathway analysis with x-axis showing response time. Blue markers ($<$30s) and purple markers ($\geq$30s) represent fast and detailed pathways respectively; filled markers indicate correct outcomes, open markers incorrect. Dashed line marks the 30s threshold.