HippoMM: Hippocampal-inspired Multimodal Memory for Long Audiovisual Event Understanding
Yueqian Lin, Qinsi Wang, Hancheng Ye, Yuzhe Fu, Hai "Helen" Li, Yiran Chen
TL;DR
HippoMM introduces a hippocampus-inspired architecture for long-form multimodal memory, translating pattern separation, pattern completion, memory consolidation, and cross-modal retrieval into an algorithmic framework. It forms episodic memories from continuous audiovisual streams, consolidates them into semantic summaries, and uses a hierarchical retrieval pipeline to answer complex queries efficiently. The HippoVlog benchmark demonstrates state-of-the-art accuracy (78.2%) and favorable latency (20.4s), with ablations highlighting the necessity of each memory mechanism and the retrieval strategy. The work advances multimodal understanding by combining neuro-inspired memory primitives with modern LLM-based reasoning, offering a scalable path toward human-like memory-enabled AI systems for long-form AV data.
Abstract
Comprehending extended audiovisual experiences remains a fundamental challenge for computational systems. Current approaches struggle with temporal integration and cross-modal associations that humans accomplish effortlessly through hippocampal-cortical networks. We introduce HippoMM, a biologically-inspired architecture that transforms hippocampal mechanisms into computational advantages for multimodal understanding. HippoMM implements three key innovations: (i) hippocampus-inspired pattern separation and completion specifically designed for continuous audiovisual streams, (ii) short-to-long term memory consolidation that transforms perceptual details into semantic abstractions, and (iii) cross-modal associative retrieval pathways enabling modality-crossing queries. Unlike existing retrieval systems with static indexing schemes, HippoMM dynamically forms integrated episodic representations through adaptive temporal segmentation and dual-process memory encoding. Evaluations on our challenging HippoVlog benchmark demonstrate that HippoMM significantly outperforms state-of-the-art approaches (78.2% vs. 64.2% accuracy) while providing substantially faster response times (20.4s vs. 112.5s). Our results demonstrate that translating neuroscientific memory principles into computational architectures provides a promising foundation for next-generation multimodal understanding systems. The code and benchmark dataset are publicly available at https://github.com/linyueqian/HippoMM.
