Table of Contents
Fetching ...

UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers

Dehai Min, Zhiyang Xu, Guilin Qi, Lifu Huang, Chenyu You

TL;DR

UniHGKR introduces a unified, instruction-aware retriever for heterogeneous knowledge sources and presents CompMix-IR, the first native benchmark for such retrieval. The framework employs a three-stage training pipeline—self-supervised unified embedding, text-anchored alignment, and instruction-guided fine-tuning with specialized contrastive losses—to build a shared embedding space that respects user instructions. Empirical results show substantial gains over state-of-the-art baselines on CompMix-IR, with further improvements when extending to LLM-based retrievers and open-domain QA tasks like ConvMix. The approach demonstrates strong zero-shot generalization and practical impact for heterogeneous QA systems, enabling more faithful retrieval across diverse data modalities. Future work includes expanding domain coverage, incorporating additional modalities, and releasing resources to foster broader evaluation.

Abstract

Existing information retrieval (IR) models often assume a homogeneous structure for knowledge sources and user queries, limiting their applicability in real-world settings where retrieval is inherently heterogeneous and diverse. In this paper, we introduce UniHGKR, a unified instruction-aware heterogeneous knowledge retriever that (1) builds a unified retrieval space for heterogeneous knowledge and (2) follows diverse user instructions to retrieve knowledge of specified types. UniHGKR consists of three principal stages: heterogeneous self-supervised pretraining, text-anchored embedding alignment, and instruction-aware retriever fine-tuning, enabling it to generalize across varied retrieval contexts. This framework is highly scalable, with a BERT-based version and a UniHGKR-7B version trained on large language models. Also, we introduce CompMix-IR, the first native heterogeneous knowledge retrieval benchmark. It includes two retrieval scenarios with various instructions, over 9,400 question-answer (QA) pairs, and a corpus of 10 million entries, covering four different types of data. Extensive experiments show that UniHGKR consistently outperforms state-of-the-art methods on CompMix-IR, achieving up to 6.36% and 54.23% relative improvements in two scenarios, respectively. Finally, by equipping our retriever for open-domain heterogeneous QA systems, we achieve a new state-of-the-art result on the popular ConvMix task, with an absolute improvement of up to 5.90 points.

UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers

TL;DR

UniHGKR introduces a unified, instruction-aware retriever for heterogeneous knowledge sources and presents CompMix-IR, the first native benchmark for such retrieval. The framework employs a three-stage training pipeline—self-supervised unified embedding, text-anchored alignment, and instruction-guided fine-tuning with specialized contrastive losses—to build a shared embedding space that respects user instructions. Empirical results show substantial gains over state-of-the-art baselines on CompMix-IR, with further improvements when extending to LLM-based retrievers and open-domain QA tasks like ConvMix. The approach demonstrates strong zero-shot generalization and practical impact for heterogeneous QA systems, enabling more faithful retrieval across diverse data modalities. Future work includes expanding domain coverage, incorporating additional modalities, and releasing resources to foster broader evaluation.

Abstract

Existing information retrieval (IR) models often assume a homogeneous structure for knowledge sources and user queries, limiting their applicability in real-world settings where retrieval is inherently heterogeneous and diverse. In this paper, we introduce UniHGKR, a unified instruction-aware heterogeneous knowledge retriever that (1) builds a unified retrieval space for heterogeneous knowledge and (2) follows diverse user instructions to retrieve knowledge of specified types. UniHGKR consists of three principal stages: heterogeneous self-supervised pretraining, text-anchored embedding alignment, and instruction-aware retriever fine-tuning, enabling it to generalize across varied retrieval contexts. This framework is highly scalable, with a BERT-based version and a UniHGKR-7B version trained on large language models. Also, we introduce CompMix-IR, the first native heterogeneous knowledge retrieval benchmark. It includes two retrieval scenarios with various instructions, over 9,400 question-answer (QA) pairs, and a corpus of 10 million entries, covering four different types of data. Extensive experiments show that UniHGKR consistently outperforms state-of-the-art methods on CompMix-IR, achieving up to 6.36% and 54.23% relative improvements in two scenarios, respectively. Finally, by equipping our retriever for open-domain heterogeneous QA systems, we achieve a new state-of-the-art result on the popular ConvMix task, with an absolute improvement of up to 5.90 points.

Paper Structure

This paper contains 33 sections, 7 equations, 5 figures, 16 tables.

Figures (5)

  • Figure 1: Compared to traditional methods, UniHGKR follows user instructions to process queries and retrieves from a heterogeneous knowledge candidates pool.
  • Figure 2: Illustration of our UniHGKR training framework.
  • Figure 3: Illustration of Data-Text Pair Collection. The bold red is and the comma , are used in concatenation template when linearizing structured data. The prompts used for GPT-4o-mini can be found in Appendix \ref{['sec:prompt_tempplate']}.
  • Figure 4: The performance of UniHGKR-base in retrieval Scenario 1 with longer evidences. Here, 10X indicates that the average length of the evidence in the corpus is 10 times the original (1X), and so on.
  • Figure 5: The performance of UniHGKR-base in retrieval Scenario 2 with longer evidences.