Multi-Facet Blending for Faceted Query-by-Example Retrieval
Heejin Do, Sangwon Ryu, Jonghwi Kim, Gary Geunbae Lee
TL;DR
This work tackles faceted query-by-example (QBE), where user intent is conveyed through facet constraints rather than whole-document similarity. FaBle introduces a modular, LLM-guided augmentation pipeline that decomposes documents into facet units, generates facet-specific similar and dissimilar fragments, and recomposes them into facet-conditioned training pairs without predefined facet labels. Through triplet-based fine-tuning of document embeddings, FaBle improves facet-aware retrieval, achieving notable gains on CSFCube and, importantly, transferring to educational items via the new FEIR benchmark, demonstrating domain generalization in data-scarce settings. The approach reduces reliance on large labeled datasets or domain-specific annotations and is complemented by releasing FEIR and accompanying code on GitHub for broader adoption and future research.
Abstract
With the growing demand to fit fine-grained user intents, faceted query-by-example (QBE), which retrieves similar documents conditioned on specific facets, has gained recent attention. However, prior approaches mainly depend on document-level comparisons using basic indicators like citations due to the lack of facet-level relevance datasets; yet, this limits their use to citation-based domains and fails to capture the intricacies of facet constraints. In this paper, we propose a multi-facet blending (FaBle) augmentation method, which exploits modularity by decomposing and recomposing to explicitly synthesize facet-specific training sets. We automatically decompose documents into facet units and generate (ir)relevant pairs by leveraging LLMs' intrinsic distinguishing capabilities; then, dynamically recomposing the units leads to facet-wise relevance-informed document pairs. Our modularization eliminates the need for pre-defined facet knowledge or labels. Further, to prove the FaBle's efficacy in a new domain beyond citation-based scientific paper retrieval, we release a benchmark dataset for educational exam item QBE. FaBle augmentation on 1K documents remarkably assists training in obtaining facet conditional embeddings.
