Table of Contents
Fetching ...

Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding

Xingjian Diao, Chunhui Zhang, Weiyi Wu, Zhongyu Ouyang, Peijun Qing, Ming Cheng, Soroush Vosoughi, Jiang Gui

TL;DR

This work tackles the challenge of long-range temporal reasoning in multimodal foundation models by introducing Temporal Working Memory (TWM), a plug-in cognitive module that uses query-guided attention to retain only task-relevant video and audio segments. TWM maintains separate visual and auditory memory buffers and employs search-update cycles with multi-scale temporal attention, guided by a language query, to filter out noise and preserve informative content. Integrated with nine state-of-the-art MFMs and evaluated on AVQA, video captioning, and video-text retrieval across MUSIC-AVQA, MSR-VTT, and CMD, TWM yields consistent gains in cross-modal reasoning, temporal coherence, and retrieval performance. The approach demonstrates that selective, query-driven memory management can significantly extend the temporal reasoning capabilities of MFMs, enabling more robust analysis of complex, time-sensitive multimedia data.

Abstract

Multimodal foundation models (MFMs) have demonstrated significant success in tasks such as visual captioning, question answering, and image-text retrieval. However, these models face inherent limitations due to their finite internal capacity, which restricts their ability to process extended temporal sequences, a crucial requirement for comprehensive video and audio analysis. To overcome these challenges, we introduce a specialized cognitive module, temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of MFMs. It selectively retains task-relevant information across temporal dimensions, ensuring that critical details are preserved throughout the processing of video and audio content. The TWM uses a query-guided attention approach to focus on the most informative multimodal segments within temporal sequences. By retaining only the most relevant content, TWM optimizes the use of the model's limited capacity, enhancing its temporal modeling ability. This plug-and-play module can be easily integrated into existing MFMs. With our TWM, nine state-of-the-art models exhibit significant performance improvements across tasks such as video captioning, question answering, and video-text retrieval. By enhancing temporal modeling, TWM extends the capability of MFMs to handle complex, time-sensitive data effectively. Our code is available at https://github.com/xid32/NAACL_2025_TWM.

Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding

TL;DR

This work tackles the challenge of long-range temporal reasoning in multimodal foundation models by introducing Temporal Working Memory (TWM), a plug-in cognitive module that uses query-guided attention to retain only task-relevant video and audio segments. TWM maintains separate visual and auditory memory buffers and employs search-update cycles with multi-scale temporal attention, guided by a language query, to filter out noise and preserve informative content. Integrated with nine state-of-the-art MFMs and evaluated on AVQA, video captioning, and video-text retrieval across MUSIC-AVQA, MSR-VTT, and CMD, TWM yields consistent gains in cross-modal reasoning, temporal coherence, and retrieval performance. The approach demonstrates that selective, query-driven memory management can significantly extend the temporal reasoning capabilities of MFMs, enabling more robust analysis of complex, time-sensitive multimedia data.

Abstract

Multimodal foundation models (MFMs) have demonstrated significant success in tasks such as visual captioning, question answering, and image-text retrieval. However, these models face inherent limitations due to their finite internal capacity, which restricts their ability to process extended temporal sequences, a crucial requirement for comprehensive video and audio analysis. To overcome these challenges, we introduce a specialized cognitive module, temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of MFMs. It selectively retains task-relevant information across temporal dimensions, ensuring that critical details are preserved throughout the processing of video and audio content. The TWM uses a query-guided attention approach to focus on the most informative multimodal segments within temporal sequences. By retaining only the most relevant content, TWM optimizes the use of the model's limited capacity, enhancing its temporal modeling ability. This plug-and-play module can be easily integrated into existing MFMs. With our TWM, nine state-of-the-art models exhibit significant performance improvements across tasks such as video captioning, question answering, and video-text retrieval. By enhancing temporal modeling, TWM extends the capability of MFMs to handle complex, time-sensitive data effectively. Our code is available at https://github.com/xid32/NAACL_2025_TWM.

Paper Structure

This paper contains 38 sections, 3 equations, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: Temporal Working Memory (TWM): TWM employs search engine and memory refresh mechanisms to retain key segments in long multimodal inputs.
  • Figure 2: The temporal working memory (TWM) pipeline retains the most relevant segments from video and audio inputs based on a language query. The Language Encoder processes the query, guiding the Search Engine to identify and select key video and audio segments. TWM ensures the retention of only the most informative data, enabling the efficient utilization of multimodal foundation models' capabilities.
  • Figure 3: Aligning frames with language query. A linear projection layer trained with InfoNCE loss aligns visual embeddings with query-based anchors.
  • Figure 4: Similarity search for query-relevant audio segments. The audio encoder utilizes visual embeddings as a query to search for the most relevant audio segments, updating the auditory buffer to retain only the essential audio information.
  • Figure 5: Audio segments aligned with query-relevant frames. An audio encoder learns temporal distance and resolution for audio-visual embedding alignment.
  • ...and 8 more figures