Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding
Xingjian Diao, Chunhui Zhang, Weiyi Wu, Zhongyu Ouyang, Peijun Qing, Ming Cheng, Soroush Vosoughi, Jiang Gui
TL;DR
This work tackles the challenge of long-range temporal reasoning in multimodal foundation models by introducing Temporal Working Memory (TWM), a plug-in cognitive module that uses query-guided attention to retain only task-relevant video and audio segments. TWM maintains separate visual and auditory memory buffers and employs search-update cycles with multi-scale temporal attention, guided by a language query, to filter out noise and preserve informative content. Integrated with nine state-of-the-art MFMs and evaluated on AVQA, video captioning, and video-text retrieval across MUSIC-AVQA, MSR-VTT, and CMD, TWM yields consistent gains in cross-modal reasoning, temporal coherence, and retrieval performance. The approach demonstrates that selective, query-driven memory management can significantly extend the temporal reasoning capabilities of MFMs, enabling more robust analysis of complex, time-sensitive multimedia data.
Abstract
Multimodal foundation models (MFMs) have demonstrated significant success in tasks such as visual captioning, question answering, and image-text retrieval. However, these models face inherent limitations due to their finite internal capacity, which restricts their ability to process extended temporal sequences, a crucial requirement for comprehensive video and audio analysis. To overcome these challenges, we introduce a specialized cognitive module, temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of MFMs. It selectively retains task-relevant information across temporal dimensions, ensuring that critical details are preserved throughout the processing of video and audio content. The TWM uses a query-guided attention approach to focus on the most informative multimodal segments within temporal sequences. By retaining only the most relevant content, TWM optimizes the use of the model's limited capacity, enhancing its temporal modeling ability. This plug-and-play module can be easily integrated into existing MFMs. With our TWM, nine state-of-the-art models exhibit significant performance improvements across tasks such as video captioning, question answering, and video-text retrieval. By enhancing temporal modeling, TWM extends the capability of MFMs to handle complex, time-sensitive data effectively. Our code is available at https://github.com/xid32/NAACL_2025_TWM.
