Table of Contents
Fetching ...

LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

Weiheng Lu, Jian Li, An Yu, Ming-Ching Chang, Shengpeng Ji, Min Xia

TL;DR

LLaVA-MR tackles video moment retrieval with long sequences by enhancing temporal perception in multimodal LLMs through Dense Frame and Time Encoding (DFTE), Informative Frame Selection (IFS), and Dynamic Token Compression (DTC). The approach fuses dense spatiotemporal features with a frozen visual encoder and Q-Former, feeding an LLM to generate open-ended moment sequences; trained with LoRA and evaluated on Charades-STA, QVHighlights, and ActivityNet Captions, it achieves state-of-the-art results and better scalability to long videos. Key contributions include a unified architecture for end-to-end moment localization, rigorous ablations validating the benefits of each component, and extensive qualitative and cross-dataset analyses. The work advances practical moment retrieval by improving temporal precision, reducing sequence length, and enabling robust generalization across diverse video domains, with open-source release planned.

Abstract

Multimodal Large Language Models (MLLMs) are widely used for visual perception, understanding, and reasoning. However, long video processing and precise moment retrieval remain challenging due to LLMs' limited context size and coarse frame extraction. We propose the Large Language-and-Vision Assistant for Moment Retrieval (LLaVA-MR), which enables accurate moment retrieval and contextual grounding in videos using MLLMs. LLaVA-MR combines Dense Frame and Time Encoding (DFTE) for spatial-temporal feature extraction, Informative Frame Selection (IFS) for capturing brief visual and motion patterns, and Dynamic Token Compression (DTC) to manage LLM context limitations. Evaluations on benchmarks like Charades-STA and QVHighlights demonstrate that LLaVA-MR outperforms 11 state-of-the-art methods, achieving an improvement of 1.82% in R1@0.5 and 1.29% in mAP@0.5 on the QVHighlights dataset. Our implementation will be open-sourced upon acceptance.

LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

TL;DR

LLaVA-MR tackles video moment retrieval with long sequences by enhancing temporal perception in multimodal LLMs through Dense Frame and Time Encoding (DFTE), Informative Frame Selection (IFS), and Dynamic Token Compression (DTC). The approach fuses dense spatiotemporal features with a frozen visual encoder and Q-Former, feeding an LLM to generate open-ended moment sequences; trained with LoRA and evaluated on Charades-STA, QVHighlights, and ActivityNet Captions, it achieves state-of-the-art results and better scalability to long videos. Key contributions include a unified architecture for end-to-end moment localization, rigorous ablations validating the benefits of each component, and extensive qualitative and cross-dataset analyses. The work advances practical moment retrieval by improving temporal precision, reducing sequence length, and enabling robust generalization across diverse video domains, with open-source release planned.

Abstract

Multimodal Large Language Models (MLLMs) are widely used for visual perception, understanding, and reasoning. However, long video processing and precise moment retrieval remain challenging due to LLMs' limited context size and coarse frame extraction. We propose the Large Language-and-Vision Assistant for Moment Retrieval (LLaVA-MR), which enables accurate moment retrieval and contextual grounding in videos using MLLMs. LLaVA-MR combines Dense Frame and Time Encoding (DFTE) for spatial-temporal feature extraction, Informative Frame Selection (IFS) for capturing brief visual and motion patterns, and Dynamic Token Compression (DTC) to manage LLM context limitations. Evaluations on benchmarks like Charades-STA and QVHighlights demonstrate that LLaVA-MR outperforms 11 state-of-the-art methods, achieving an improvement of 1.82% in R1@0.5 and 1.29% in mAP@0.5 on the QVHighlights dataset. Our implementation will be open-sourced upon acceptance.

Paper Structure

This paper contains 21 sections, 5 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Advantages and Comparison of LLaVA-MR over prior work. (a) Traditional transformer-based methods. (b) Previous MLLMs-based methods. (c) Our LLaVA-MR.
  • Figure 2: LLaVA-MR model overview. We leverage a pretrained MLLM such as BLIP-2li2023Blip2. Our model primarily consists of Dense Frame and Time Encoding (DFTE), Informative Frame Selection (IFS), and Dynamic Token Compression (DTC).
  • Figure 3: Two compression methods in DTC. (a) Average Pooling reduces tokens by averaging feature values within a window of 2, where $f_i^q \in N^q$. (b) Variance-Based DTC calculates the variance of each query in the Q-Former across all frames (where Q is the number of queries), sorts queries by variance, and selects the top half of queries to retain tokens focusing on dynamic content.
  • Figure 4: Qualitative results on the Charades-STA and QVHighlights datasets, with ground truth segments for query events alongside highlighted predicted intervals.
  • Figure 5: Temporal correspondence between video frames and adjacent frame feature distance $\hat{\mathbf{d}}$.
  • ...and 2 more figures