Table of Contents
Fetching ...

Multi-Facet Blending for Faceted Query-by-Example Retrieval

Heejin Do, Sangwon Ryu, Jonghwi Kim, Gary Geunbae Lee

TL;DR

This work tackles faceted query-by-example (QBE), where user intent is conveyed through facet constraints rather than whole-document similarity. FaBle introduces a modular, LLM-guided augmentation pipeline that decomposes documents into facet units, generates facet-specific similar and dissimilar fragments, and recomposes them into facet-conditioned training pairs without predefined facet labels. Through triplet-based fine-tuning of document embeddings, FaBle improves facet-aware retrieval, achieving notable gains on CSFCube and, importantly, transferring to educational items via the new FEIR benchmark, demonstrating domain generalization in data-scarce settings. The approach reduces reliance on large labeled datasets or domain-specific annotations and is complemented by releasing FEIR and accompanying code on GitHub for broader adoption and future research.

Abstract

With the growing demand to fit fine-grained user intents, faceted query-by-example (QBE), which retrieves similar documents conditioned on specific facets, has gained recent attention. However, prior approaches mainly depend on document-level comparisons using basic indicators like citations due to the lack of facet-level relevance datasets; yet, this limits their use to citation-based domains and fails to capture the intricacies of facet constraints. In this paper, we propose a multi-facet blending (FaBle) augmentation method, which exploits modularity by decomposing and recomposing to explicitly synthesize facet-specific training sets. We automatically decompose documents into facet units and generate (ir)relevant pairs by leveraging LLMs' intrinsic distinguishing capabilities; then, dynamically recomposing the units leads to facet-wise relevance-informed document pairs. Our modularization eliminates the need for pre-defined facet knowledge or labels. Further, to prove the FaBle's efficacy in a new domain beyond citation-based scientific paper retrieval, we release a benchmark dataset for educational exam item QBE. FaBle augmentation on 1K documents remarkably assists training in obtaining facet conditional embeddings.

Multi-Facet Blending for Faceted Query-by-Example Retrieval

TL;DR

This work tackles faceted query-by-example (QBE), where user intent is conveyed through facet constraints rather than whole-document similarity. FaBle introduces a modular, LLM-guided augmentation pipeline that decomposes documents into facet units, generates facet-specific similar and dissimilar fragments, and recomposes them into facet-conditioned training pairs without predefined facet labels. Through triplet-based fine-tuning of document embeddings, FaBle improves facet-aware retrieval, achieving notable gains on CSFCube and, importantly, transferring to educational items via the new FEIR benchmark, demonstrating domain generalization in data-scarce settings. The approach reduces reliance on large labeled datasets or domain-specific annotations and is complemented by releasing FEIR and accompanying code on GitHub for broader adoption and future research.

Abstract

With the growing demand to fit fine-grained user intents, faceted query-by-example (QBE), which retrieves similar documents conditioned on specific facets, has gained recent attention. However, prior approaches mainly depend on document-level comparisons using basic indicators like citations due to the lack of facet-level relevance datasets; yet, this limits their use to citation-based domains and fails to capture the intricacies of facet constraints. In this paper, we propose a multi-facet blending (FaBle) augmentation method, which exploits modularity by decomposing and recomposing to explicitly synthesize facet-specific training sets. We automatically decompose documents into facet units and generate (ir)relevant pairs by leveraging LLMs' intrinsic distinguishing capabilities; then, dynamically recomposing the units leads to facet-wise relevance-informed document pairs. Our modularization eliminates the need for pre-defined facet knowledge or labels. Further, to prove the FaBle's efficacy in a new domain beyond citation-based scientific paper retrieval, we release a benchmark dataset for educational exam item QBE. FaBle augmentation on 1K documents remarkably assists training in obtaining facet conditional embeddings.

Paper Structure

This paper contains 30 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Examples of documents with multiple facets.
  • Figure 2: The overview of the FaBle method and examples of detailed prompts used for scientific paper retrieval.
  • Figure 3: Examples of the generated similar and dissimilarmethod facets with our self-fed decomposition (left) and without the (w/o) decomposition (right). Directly generating similar and dissimilar facets without decomposition can lead to the results containing facets other than the intended one, as highlighted.
  • Figure 4: Hard negative generation procedure (§ \ref{['sec3.5']}).
  • Figure 5: Score label distributions per query by facet.
  • ...and 2 more figures