Table of Contents
Fetching ...

De-DSI: Decentralised Differentiable Search Index

Petru Neague, Marcel Gregoriadis, Johan Pouwelse

TL;DR

De-DSI addresses privacy-preserving, scalable information retrieval by decentralising the training and inference of a differentiable search index. It introduces a shard-ensemble of DSI models and a softmax-based confidence ensemble to aggregate cross-peer results without accessing document content. The approach demonstrates that decentralized retrieval can achieve performance comparable to centralized methods and even supports retrieval of magnet links for multimedia items. This work provides a practical blueprint for Web-scale, privacy-aware search powered by decentralized generative AI.

Abstract

This study introduces De-DSI, a novel framework that fuses large language models (LLMs) with genuine decentralization for information retrieval, particularly employing the differentiable search index (DSI) concept in a decentralized setting. Focused on efficiently connecting novel user queries with document identifiers without direct document access, De-DSI operates solely on query-docid pairs. To enhance scalability, an ensemble of DSI models is introduced, where the dataset is partitioned into smaller shards for individual model training. This approach not only maintains accuracy by reducing the number of data each model needs to handle but also facilitates scalability by aggregating outcomes from multiple models. This aggregation uses a beam search to identify top docids and applies a softmax function for score normalization, selecting documents with the highest scores for retrieval. The decentralized implementation demonstrates that retrieval success is comparable to centralized methods, with the added benefit of the possibility of distributing computational complexity across the network. This setup also allows for the retrieval of multimedia items through magnet links, eliminating the need for platforms or intermediaries.

De-DSI: Decentralised Differentiable Search Index

TL;DR

De-DSI addresses privacy-preserving, scalable information retrieval by decentralising the training and inference of a differentiable search index. It introduces a shard-ensemble of DSI models and a softmax-based confidence ensemble to aggregate cross-peer results without accessing document content. The approach demonstrates that decentralized retrieval can achieve performance comparable to centralized methods and even supports retrieval of magnet links for multimedia items. This work provides a practical blueprint for Web-scale, privacy-aware search powered by decentralized generative AI.

Abstract

This study introduces De-DSI, a novel framework that fuses large language models (LLMs) with genuine decentralization for information retrieval, particularly employing the differentiable search index (DSI) concept in a decentralized setting. Focused on efficiently connecting novel user queries with document identifiers without direct document access, De-DSI operates solely on query-docid pairs. To enhance scalability, an ensemble of DSI models is introduced, where the dataset is partitioned into smaller shards for individual model training. This approach not only maintains accuracy by reducing the number of data each model needs to handle but also facilitates scalability by aggregating outcomes from multiple models. This aggregation uses a beam search to identify top docids and applies a softmax function for score normalization, selecting documents with the highest scores for retrieval. The decentralized implementation demonstrates that retrieval success is comparable to centralized methods, with the added benefit of the possibility of distributing computational complexity across the network. This setup also allows for the retrieval of multimedia items through magnet links, eliminating the need for platforms or intermediaries.
Paper Structure (12 sections, 1 equation, 6 figures, 4 tables)

This paper contains 12 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Confidence-Ensemble using 10 shards (i.e. 10 peer groups). The scores under each shard are post-softmax. The result of the ensemble is the top5 documents with largest post-softmax score.
  • Figure 2: Success rate matching unseen queries to the correct document, based on a number of seen queries trained on.
  • Figure 3: Results of top-$k$ accuracies when inferring from the ensemble (Ens) vs. from only the personal model (Pers), with $k=1..5$.
  • Figure 4: Rolling mean (with window = 500) of loss per batch for 10 peers within the same shard. Each color represents the loss of one of the peers.
  • Figure 5: Accuracies on the test set, by shard and beam. Blue dots represent the top-1 accuracy, while red dots show the top-5 accuracy of one peer in the associated shard.
  • ...and 1 more figures