LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

Weiheng Lu; Jian Li; An Yu; Ming-Ching Chang; Shengpeng Ji; Min Xia

LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

Weiheng Lu, Jian Li, An Yu, Ming-Ching Chang, Shengpeng Ji, Min Xia

TL;DR

LLaVA-MR tackles video moment retrieval with long sequences by enhancing temporal perception in multimodal LLMs through Dense Frame and Time Encoding (DFTE), Informative Frame Selection (IFS), and Dynamic Token Compression (DTC). The approach fuses dense spatiotemporal features with a frozen visual encoder and Q-Former, feeding an LLM to generate open-ended moment sequences; trained with LoRA and evaluated on Charades-STA, QVHighlights, and ActivityNet Captions, it achieves state-of-the-art results and better scalability to long videos. Key contributions include a unified architecture for end-to-end moment localization, rigorous ablations validating the benefits of each component, and extensive qualitative and cross-dataset analyses. The work advances practical moment retrieval by improving temporal precision, reducing sequence length, and enabling robust generalization across diverse video domains, with open-source release planned.

Abstract

Multimodal Large Language Models (MLLMs) are widely used for visual perception, understanding, and reasoning. However, long video processing and precise moment retrieval remain challenging due to LLMs' limited context size and coarse frame extraction. We propose the Large Language-and-Vision Assistant for Moment Retrieval (LLaVA-MR), which enables accurate moment retrieval and contextual grounding in videos using MLLMs. LLaVA-MR combines Dense Frame and Time Encoding (DFTE) for spatial-temporal feature extraction, Informative Frame Selection (IFS) for capturing brief visual and motion patterns, and Dynamic Token Compression (DTC) to manage LLM context limitations. Evaluations on benchmarks like Charades-STA and QVHighlights demonstrate that LLaVA-MR outperforms 11 state-of-the-art methods, achieving an improvement of 1.82% in R1@0.5 and 1.29% in mAP@0.5 on the QVHighlights dataset. Our implementation will be open-sourced upon acceptance.

LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

TL;DR

Abstract

LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)