Table of Contents
Fetching ...

MERLIN: Multimodal Embedding Refinement via LLM-based Iterative Navigation for Text-Video Retrieval-Rerank Pipeline

Donghoon Han, Eunhwan Park, Gisang Lee, Adam Lee, Nojun Kwak

TL;DR

MERLIN (Multimodal Embedding Refinement via LLM-based Iterative Navigation via LLM-based Iterative Navigation) is introduced, a novel, training-free pipeline that leverages Large Language Models (LLMs) for iterative feedback learning.

Abstract

The rapid expansion of multimedia content has made accurately retrieving relevant videos from large collections increasingly challenging. Recent advancements in text-video retrieval have focused on cross-modal interactions, large-scale foundation model training, and probabilistic modeling, yet often neglect the crucial user perspective, leading to discrepancies between user queries and the content retrieved. To address this, we introduce MERLIN (Multimodal Embedding Refinement via LLM-based Iterative Navigation), a novel, training-free pipeline that leverages Large Language Models (LLMs) for iterative feedback learning. MERLIN refines query embeddings from a user perspective, enhancing alignment between queries and video content through a dynamic question answering process. Experimental results on datasets like MSR-VTT, MSVD, and ActivityNet demonstrate that MERLIN substantially improves Recall@1, outperforming existing systems and confirming the benefits of integrating LLMs into multimodal retrieval systems for more responsive and context-aware multimedia retrieval.

MERLIN: Multimodal Embedding Refinement via LLM-based Iterative Navigation for Text-Video Retrieval-Rerank Pipeline

TL;DR

MERLIN (Multimodal Embedding Refinement via LLM-based Iterative Navigation via LLM-based Iterative Navigation) is introduced, a novel, training-free pipeline that leverages Large Language Models (LLMs) for iterative feedback learning.

Abstract

The rapid expansion of multimedia content has made accurately retrieving relevant videos from large collections increasingly challenging. Recent advancements in text-video retrieval have focused on cross-modal interactions, large-scale foundation model training, and probabilistic modeling, yet often neglect the crucial user perspective, leading to discrepancies between user queries and the content retrieved. To address this, we introduce MERLIN (Multimodal Embedding Refinement via LLM-based Iterative Navigation), a novel, training-free pipeline that leverages Large Language Models (LLMs) for iterative feedback learning. MERLIN refines query embeddings from a user perspective, enhancing alignment between queries and video content through a dynamic question answering process. Experimental results on datasets like MSR-VTT, MSVD, and ActivityNet demonstrate that MERLIN substantially improves Recall@1, outperforming existing systems and confirming the benefits of integrating LLMs into multimodal retrieval systems for more responsive and context-aware multimedia retrieval.
Paper Structure (28 sections, 7 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 28 sections, 7 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: An illustration of the discrepancy between the video caption which could be treated as a user query and the video from MSR-VTT dataset. Blue indicates the details that can be observed statically within the video frame, while red reflects the information that can be obtained temporally across multiple frames.
  • Figure 2: An illustration of MERLIN for text-video retrieval. The yellow arrow represents the LLM Questioner returning a question for next round based on metadata of anchor video (Section \ref{['sec:question_generation']}). The green arrow represents the human-simulating LLM agent returning an answer based on the "video in mind" through Question Answering module and Aggregation module (Section \ref{['sec:human_simulating_llm_agent']}). The pink arrow represents MERLIN returning a retrieved video candidates through Multimodal Encoder and Reranker (Section \ref{['sec:refine']}). The system initially retrieves video candidates $\hat{v}^0$ based on the input query text $q$ using a pre-trained multimodal encoder. Using this anchor video, LLM Question Generator produces a question $\hat{q}^1$ to elicit additional information from the user (Section \ref{['sec:question_generation']}). The LLM Agent answers this question based on the "video in mind", mimicking the human feedback process $\Tilde{a}^1$. The query and answer embeddings are then gradually integrated for each round. The updated query embedding is used to rerank the video candidates $\hat{v}^1$, and the process repeats for multiple rounds.
  • Figure 3: An illustration of the average ranking of target video for each dataset.
  • Figure 4: Qualitative evaluation of MERLIN on ActivityNet. sample: v'_juiMCvZUYwk.
  • Figure 5: Qualitative evaluation of MERLIN on MSVD. sample: hbE29pZh76I'_3'_8.
  • ...and 1 more figures