Hierarchical Indexing with Knowledge Enrichment for Multilingual Video Corpus Retrieval
Yu Wang, Tianhao Tan, Yifei Wang
TL;DR
This work tackles multilingual retrieval of long, terminology-dense medical instructional videos by proposing a multi-stage pipeline that semantically chunks subtitles, enriches chunks with knowledge-graph facts, and stores them in a hierarchical index using language-agnostic embeddings. Retrieval uses a coarse-to-fine tree search followed by multilingual LLM re-ranking to maintain precision while avoiding exhaustive cross-encoder scoring. Ablation studies demonstrate that KG enrichment, hierarchical indexing, and LLM re-ranking each contribute meaningfully, with LLM re-ranking providing the largest performance lift, and the overall system achieving state-of-the-art results on the mVCR benchmark. The approach offers a scalable solution for accurate multilingual video retrieval in specialized medical corpora, with future directions including structured KG reasoning, knowledge distillation for compact re-rankers, and incorporating multi-modal features.
Abstract
Retrieving relevant instructional videos from multilingual medical archives is crucial for answering complex, multi-hop questions across language boundaries. However, existing systems either compress hour-long videos into coarse embeddings or incur prohibitive costs for fine-grained matching. We tackle the Multilingual Video Corpus Retrieval (mVCR) task in the NLPCC-2025 M4IVQA challenge with a multi-stage framework that integrates multilingual semantics, domain terminology, and efficient long-form processing. Video subtitles are divided into semantically coherent chunks, enriched with concise knowledge-graph (KG) facts, and organized into a hierarchical tree whose node embeddings are generated by a language-agnostic multilingual encoder. At query time, the same encoder embeds the input question; a coarse-to-fine tree search prunes irrelevant branches, and only the top-ranked chunks are re-scored by a lightweight large language model (LLM). This design avoids exhaustive cross-encoder scoring while preserving chunk-level precision. Experiments on the mVCR test set demonstrate state-of-the-art performance, and ablation studies confirm the complementary contributions of KG enrichment, hierarchical indexing, and targeted LLM re-ranking. The proposed method offers an accurate and scalable solution for multilingual retrieval in specialized medical video collections.
