ReTreever: Tree-based Coarse-to-Fine Representations for Retrieval
Shubham Gupta, Zichao Li, Tianyi Chen, Cem Subakan, Siva Reddy, Perouz Taslakian, Valentina Zantedeschi
TL;DR
ReTreever introduces a differentiable, tree-based retrieval framework that organizes document snippets into a binary tree to provide coarse-to-fine representations while preserving full embedding accuracy. By learning routing functions at internal nodes and using a contrastive objective with negative Total Variation Distance, it yields both efficient coarse representations and accurate leaf-level embeddings, without relying on costly LLMs during construction or search. The approach offers interpretable corpus organization, enabling inspection of semantic groupings and retrieval behavior across tree levels. Empirical results on NQ, HotpotQA, TopiOCQA, and RepLiQA show competitive or superior retrieval performance with lower latency compared to flat and other hierarchical baselines, highlighting its practicality for scalable and transparent retrieval systems.
Abstract
Document retrieval is a core component of question-answering systems, as it enables conditioning answer generation on new and large-scale corpora. While effective, the standard practice of encoding documents into high-dimensional embeddings for similarity search entails large memory and compute footprints, and also makes it hard to inspect the inner workings of the system. In this paper, we propose a tree-based method for organizing and representing reference documents at various granular levels, which offers the flexibility to balance cost and utility, and eases the inspection of the corpus content and retrieval operations. Our method, called ReTreever, jointly learns a routing function per internal node of a binary tree such that query and reference documents are assigned to similar tree branches, hence directly optimizing for retrieval performance. Our evaluations show that ReTreever generally preserves full representation accuracy. Its hierarchical structure further provides strong coarse representations and enhances transparency by indirectly learning meaningful semantic groupings. Among hierarchical retrieval methods, ReTreever achieves the best retrieval accuracy at the lowest latency, proving that this family of techniques can be viable in practical applications.
