Table of Contents
Fetching ...

WindTunnel -- A Framework for Community Aware Sampling of Large Corpora

Michael Iannelli

TL;DR

WindTunnel is presented, a novel framework developed at Yext to generate representative samples of large corpora, enabling efficient end-to-end information retrieval experiments and overcomes limitations in current sampling methods, providing more accurate evaluations.

Abstract

Conducting comprehensive information retrieval experiments, such as in search or retrieval augmented generation, often comes with high computational costs. This is because evaluating a retrieval algorithm requires indexing the entire corpus, which is significantly larger than the set of (query, result) pairs under evaluation. This issue is especially pronounced in big data and neural retrieval, where indexing becomes increasingly time-consuming and complex. In this paper, we present WindTunnel, a novel framework developed at Yext to generate representative samples of large corpora, enabling efficient end-to-end information retrieval experiments. By preserving the community structure of the dataset, WindTunnel overcomes limitations in current sampling methods, providing more accurate evaluations.

WindTunnel -- A Framework for Community Aware Sampling of Large Corpora

TL;DR

WindTunnel is presented, a novel framework developed at Yext to generate representative samples of large corpora, enabling efficient end-to-end information retrieval experiments and overcomes limitations in current sampling methods, providing more accurate evaluations.

Abstract

Conducting comprehensive information retrieval experiments, such as in search or retrieval augmented generation, often comes with high computational costs. This is because evaluating a retrieval algorithm requires indexing the entire corpus, which is significantly larger than the set of (query, result) pairs under evaluation. This issue is especially pronounced in big data and neural retrieval, where indexing becomes increasingly time-consuming and complex. In this paper, we present WindTunnel, a novel framework developed at Yext to generate representative samples of large corpora, enabling efficient end-to-end information retrieval experiments. By preserving the community structure of the dataset, WindTunnel overcomes limitations in current sampling methods, providing more accurate evaluations.

Paper Structure

This paper contains 13 sections, 5 figures, 2 tables, 2 algorithms.

Figures (5)

  • Figure 1: A sample from the MSMarco corpus that preserves community structure, revealing the underlying community organization of documents. Each node in the graph represents a document, and an edge between nodes indicates that the corresponding documents share a common query.
  • Figure 2: A detailed view of five nodes from a single community in the network shown in Figure \ref{['fig_corpus_graph']}, including document titles. Note the thematic consistency among document titles within the community; each document pertains to the subject of betting in some way.
  • Figure 3: High-Level Architecture of WindTunnel
  • Figure 4: Left: Histogram depicting the distribution of node degrees for MSMarco passages, where each node represents a passage, and two nodes are neighbors if their corresponding passages respond to the same query. The node degree of a node is the count of its neighbors. Right: A comparison of the MSMarco passages' node degree distribution (in blue) with a theoretical power-law distribution (in orange).
  • Figure 5: Architecture of the Semantic Search Pipeline used for our experiments. The pipeline consists of two high-level components operating asynchronously. An offline indexing component accepts the input corpus, vectorizes the elements of that corpus using an embedding model, and indexes them into a vector database. The online ranking and retrieval component accepts an input query from the end-user, vectorizes the query using the same embedding model from the offline indexing component, and then searches for approximate neighbors of that query vector in the vector database.