Table of Contents
Fetching ...

Space/time-efficient RDF stores based on circular suffix sorting

Nieves R. Brisaboa, Ana Cerdeira-Pena, Guillermo de Bernardo, Antonio Fariña, Gonzalo Navarro

TL;DR

RDFCSA is a compressed representation of RDF datasets that in addition supports efficient querying that enables efficiently supporting join queries by using either merge- or chaining-join strategies over the triple patterns coupled with some specific optimizations such as variable filling.

Abstract

In recent years, RDF has gained popularity as a format for the standardized publication and exchange of information in the Web of Data. In this paper we introduce RDFCSA, a data structure that is able to self-index an RDF dataset in small space and supports efficient querying. RDFCSA regards the triples of the RDF store as short circular strings and applies suffix sorting on those strings, so that triple-pattern queries reduce to prefix searching on the string set. The RDF store is then represented compactly using a Compressed Suffix Array (CSA), a proved technology in text indexing that efficiently supports prefix searches. Our experiments show that RDFCSA provides a compact RDF representation, using less than 60% of the space required by the raw data, and yields fast and consistent query times when answering triple-pattern queries (a few microseconds per result). We also support join queries, a key component of most SPARQL queries. RDFCSA is shown to provide an excellent space/time tradeoff, typically using much less space than alternatives that compete in time.

Space/time-efficient RDF stores based on circular suffix sorting

TL;DR

RDFCSA is a compressed representation of RDF datasets that in addition supports efficient querying that enables efficiently supporting join queries by using either merge- or chaining-join strategies over the triple patterns coupled with some specific optimizations such as variable filling.

Abstract

In recent years, RDF has gained popularity as a format for the standardized publication and exchange of information in the Web of Data. In this paper we introduce RDFCSA, a data structure that is able to self-index an RDF dataset in small space and supports efficient querying. RDFCSA regards the triples of the RDF store as short circular strings and applies suffix sorting on those strings, so that triple-pattern queries reduce to prefix searching on the string set. The RDF store is then represented compactly using a Compressed Suffix Array (CSA), a proved technology in text indexing that efficiently supports prefix searches. Our experiments show that RDFCSA provides a compact RDF representation, using less than 60% of the space required by the raw data, and yields fast and consistent query times when answering triple-pattern queries (a few microseconds per result). We also support join queries, a key component of most SPARQL queries. RDFCSA is shown to provide an excellent space/time tradeoff, typically using much less space than alternatives that compete in time.

Paper Structure

This paper contains 29 sections, 13 figures.

Figures (13)

  • Figure 1: Example of RDF graph and its representation as a set of triples.
  • Figure 2: Dictionary encoding used in HDT for the set of triples in Figure \ref{['fig:rdf']}.
  • Figure 3: Structures involved in the creation of a RDFCSA for the triples in Figures \ref{['fig:rdf']} and \ref{['fig:dict']}.
  • Figure 4: D-select+forward-check strategy for pattern $(s,p,o)=(8,4,261)$.
  • Figure 5: D-select+backward-check strategy for pattern $(s,p,o)=(8,4,261)$.
  • ...and 8 more figures