UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers
Dehai Min, Zhiyang Xu, Guilin Qi, Lifu Huang, Chenyu You
TL;DR
UniHGKR introduces a unified, instruction-aware retriever for heterogeneous knowledge sources and presents CompMix-IR, the first native benchmark for such retrieval. The framework employs a three-stage training pipeline—self-supervised unified embedding, text-anchored alignment, and instruction-guided fine-tuning with specialized contrastive losses—to build a shared embedding space that respects user instructions. Empirical results show substantial gains over state-of-the-art baselines on CompMix-IR, with further improvements when extending to LLM-based retrievers and open-domain QA tasks like ConvMix. The approach demonstrates strong zero-shot generalization and practical impact for heterogeneous QA systems, enabling more faithful retrieval across diverse data modalities. Future work includes expanding domain coverage, incorporating additional modalities, and releasing resources to foster broader evaluation.
Abstract
Existing information retrieval (IR) models often assume a homogeneous structure for knowledge sources and user queries, limiting their applicability in real-world settings where retrieval is inherently heterogeneous and diverse. In this paper, we introduce UniHGKR, a unified instruction-aware heterogeneous knowledge retriever that (1) builds a unified retrieval space for heterogeneous knowledge and (2) follows diverse user instructions to retrieve knowledge of specified types. UniHGKR consists of three principal stages: heterogeneous self-supervised pretraining, text-anchored embedding alignment, and instruction-aware retriever fine-tuning, enabling it to generalize across varied retrieval contexts. This framework is highly scalable, with a BERT-based version and a UniHGKR-7B version trained on large language models. Also, we introduce CompMix-IR, the first native heterogeneous knowledge retrieval benchmark. It includes two retrieval scenarios with various instructions, over 9,400 question-answer (QA) pairs, and a corpus of 10 million entries, covering four different types of data. Extensive experiments show that UniHGKR consistently outperforms state-of-the-art methods on CompMix-IR, achieving up to 6.36% and 54.23% relative improvements in two scenarios, respectively. Finally, by equipping our retriever for open-domain heterogeneous QA systems, we achieve a new state-of-the-art result on the popular ConvMix task, with an absolute improvement of up to 5.90 points.
