Table of Contents
Fetching ...

Hierarchical Semantic Retrieval with Cobweb

Anant Gupta, Karthik Singaravadivelan, Zekun Wang

TL;DR

This work addresses the limitation of flat neural retrieval by introducing Cobweb, a hierarchy-aware retrieval framework that organizes sentence embeddings into a prototype tree to enable coarse-to-fine and interpretable document ranking. By whitening embeddings to satisfy a diagonal covariance assumption and learning online Gaussian prototypes, Cobweb/4V supports two inference strategies—Generalized Best-First Search and Path Sum Prediction—and demonstrates competitive retrieval performance with strong encoder embeddings (e.g., RoBERTa, T5) while remaining robust when dot-product retrieval degrades (e.g., GPT-2). Across MS MARCO and QQP, Cobweb matches or surpasses inner-product baselines, scales effectively with corpus size, and provides interpretable multi-level relevance signals via prototypes. The results suggest practical impact for scalable, explainable retrieval that leverages corpus structure, with future directions including differentiable Cobweb integration and isotropic embedding design. $s(c)=p(x|c)p(c|x)$ and $score( ext{leaf})= ext{path}( ext{leaf})_{}ig( obreak ig) \sum obreak \log s(c)\big)$ encode the multi-level aggregation that underpins the hierarchical retrieval. $CU(c)=P(c)[U(c_p)-U(c)]$ governs prototype formation during training, reinforcing discriminative, interpretable clusters.

Abstract

Neural document retrieval often treats a corpus as a flat cloud of vectors scored at a single granularity, leaving corpus structure underused and explanations opaque. We use Cobweb--a hierarchy-aware framework--to organize sentence embeddings into a prototype tree and rank documents via coarse-to-fine traversal. Internal nodes act as concept prototypes, providing multi-granular relevance signals and a transparent rationale through retrieval paths. We instantiate two inference approaches: a generalized best-first search and a lightweight path-sum ranker. We evaluate our approaches on MS MARCO and QQP with encoder (e.g., BERT/T5) and decoder (GPT-2) representations. Our results show that our retrieval approaches match the dot product search on strong encoder embeddings while remaining robust when kNN degrades: with GPT-2 vectors, dot product performance collapses whereas our approaches still retrieve relevant results. Overall, our experiments suggest that Cobweb provides competitive effectiveness, improved robustness to embedding quality, scalability, and interpretable retrieval via hierarchical prototypes.

Hierarchical Semantic Retrieval with Cobweb

TL;DR

This work addresses the limitation of flat neural retrieval by introducing Cobweb, a hierarchy-aware retrieval framework that organizes sentence embeddings into a prototype tree to enable coarse-to-fine and interpretable document ranking. By whitening embeddings to satisfy a diagonal covariance assumption and learning online Gaussian prototypes, Cobweb/4V supports two inference strategies—Generalized Best-First Search and Path Sum Prediction—and demonstrates competitive retrieval performance with strong encoder embeddings (e.g., RoBERTa, T5) while remaining robust when dot-product retrieval degrades (e.g., GPT-2). Across MS MARCO and QQP, Cobweb matches or surpasses inner-product baselines, scales effectively with corpus size, and provides interpretable multi-level relevance signals via prototypes. The results suggest practical impact for scalable, explainable retrieval that leverages corpus structure, with future directions including differentiable Cobweb integration and isotropic embedding design. and encode the multi-level aggregation that underpins the hierarchical retrieval. governs prototype formation during training, reinforcing discriminative, interpretable clusters.

Abstract

Neural document retrieval often treats a corpus as a flat cloud of vectors scored at a single granularity, leaving corpus structure underused and explanations opaque. We use Cobweb--a hierarchy-aware framework--to organize sentence embeddings into a prototype tree and rank documents via coarse-to-fine traversal. Internal nodes act as concept prototypes, providing multi-granular relevance signals and a transparent rationale through retrieval paths. We instantiate two inference approaches: a generalized best-first search and a lightweight path-sum ranker. We evaluate our approaches on MS MARCO and QQP with encoder (e.g., BERT/T5) and decoder (GPT-2) representations. Our results show that our retrieval approaches match the dot product search on strong encoder embeddings while remaining robust when kNN degrades: with GPT-2 vectors, dot product performance collapses whereas our approaches still retrieve relevant results. Overall, our experiments suggest that Cobweb provides competitive effectiveness, improved robustness to embedding quality, scalability, and interpretable retrieval via hierarchical prototypes.

Paper Structure

This paper contains 40 sections, 7 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Examples of learned sub-hierarchies of whitened GPT-2 embeddings from the MS-MARCO dataset showcasing how the Cobweb-BFS metric appropriately retrieves relevant documents on the query "synonyms of inhabitable" while the dot product fails to retrieve relevant documents. The correct document is highlighted in red.
  • Figure 2: Examples of learned Cobweb/4V's sub-hierarchies on QQP and MS MARCO using RoBERTa embeddings, with subtopics color‐coded by theme. Yellow: education in India, Red: earning money, Green: biochemistry, Purple: nutrition.
  • Figure :
  • Figure :