F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model

Hanbo Bi; Zhiqiang Yuan; Zexi Jia; Jiapei Zhang; Chongyang Li; Peixiang Luo; Ying Deng; Xiaoyue Duan; Jinchao Zhang

F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model

Hanbo Bi, Zhiqiang Yuan, Zexi Jia, Jiapei Zhang, Chongyang Li, Peixiang Luo, Ying Deng, Xiaoyue Duan, Jinchao Zhang

TL;DR

This work defines Fine-grained Fragment Retrieval (FFR) to locate semantically coherent utterance-image fragments within long-form multimodal dialogues, addressing limitations of traditional retrieval. It introduces MLDR, a long-turn multimodal dialogue dataset, and a real-world WeChat test set to probe cross-domain generalization. The proposed F2RVLM framework uses a two-stage training regime—supervised fine-tuning and GRPO-based reinforcement learning—with multi-objective rewards and a difficulty-aware curriculum to optimize fragment semantic coherence and retrieval precision. Across in-domain and real-world evaluations, F2RVLM achieves state-of-the-art fragment retrieval performance, demonstrating robust long-context reasoning and cross-domain generalization with relatively efficient model sizes.

Abstract

Traditional dialogue retrieval aims to select the most appropriate utterance or image from recent dialogue history. However, they often fail to meet users' actual needs for revisiting semantically coherent content scattered across long-form conversations. To fill this gap, we define the Fine-grained Fragment Retrieval (FFR) task, requiring models to locate query-relevant fragments, comprising both utterances and images, from multimodal long-form dialogues. As a foundation for FFR, we construct MLDR, the longest-turn multimodal dialogue retrieval dataset to date, averaging 25.45 turns per dialogue, with each naturally spanning three distinct topics. To evaluate generalization in real-world scenarios, we curate and annotate a WeChat-based test set comprising real-world multimodal dialogues with an average of 75.38 turns. Building on these resources, we explore existing generation-based Vision-Language Models (VLMs) on FFR and observe that they often retrieve incoherent utterance-image fragments. While optimized for generating responses from visual-textual inputs, these models lack explicit supervision to ensure semantic coherence within retrieved fragments. To this end, we propose F2RVLM, a generative retrieval model trained in a two-stage paradigm: (1) supervised fine-tuning to inject fragment-level retrieval knowledge, and (2) GRPO-based reinforcement learning with multi-objective rewards promoting semantic precision, relevance, and contextual coherence. To handle varying intra-fragment complexity, from locally dense to sparsely distributed, we introduce difficulty-aware curriculum sampling that ranks training instances by model-predicted difficulty and gradually exposes the model to harder samples. This boosts reasoning ability in long, multi-turn contexts. F2RVLM outperforms popular VLMs in both in-domain and real-domain settings, demonstrating superior retrieval performance.

F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model

TL;DR

Abstract

F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)