Table of Contents
Fetching ...

Indaleko: The Unified Personal Index

William Anthony Mason

TL;DR

The Unified Personal Index is presented, a memory-aligned architecture that bridges the fundamental gap in personal information retrieval and transforms personal information retrieval from keyword matching to memory-aligned finding, providing immediate benefits for existing data while establishing foundations for future context-aware systems.

Abstract

Personal information retrieval fails when systems ignore how human memory works. While existing platforms force keyword searches across isolated silos, humans naturally recall through episodic cues like when, where, and in what context information was encountered. This dissertation presents the Unified Personal Index (UPI), a memory-aligned architecture that bridges this fundamental gap. The Indaleko prototype demonstrates the UPI's feasibility on a 31-million file dataset spanning 160TB across eight storage platforms. By integrating temporal, spatial, and activity metadata into a unified graph database, Indaleko enables natural language queries like "photos near the conference venue last spring" that existing systems cannot process. The implementation achieves sub-second query responses through memory anchor indexing, eliminates cross-platform search fragmentation, and maintains perfect precision for well-specified memory patterns. Evaluation against commercial systems (Google Drive, OneDrive, Dropbox, Windows Search) reveals that all fail on memory-based queries, returning overwhelming result sets without contextual filtering. In contrast, Indaleko successfully processes multi-dimensional queries combining time, location, and activity patterns. The extensible architecture supports rapid integration of new data sources (10 minutes to 10 hours per provider) while preserving privacy through UUID-based semantic decoupling. The UPI's architectural synthesis bridges cognitive theory with distributed systems design, as demonstrated through the Indaleko prototype and rigorous evaluation. This work transforms personal information retrieval from keyword matching to memory-aligned finding, providing immediate benefits for existing data while establishing foundations for future context-aware systems.

Indaleko: The Unified Personal Index

TL;DR

The Unified Personal Index is presented, a memory-aligned architecture that bridges the fundamental gap in personal information retrieval and transforms personal information retrieval from keyword matching to memory-aligned finding, providing immediate benefits for existing data while establishing foundations for future context-aware systems.

Abstract

Personal information retrieval fails when systems ignore how human memory works. While existing platforms force keyword searches across isolated silos, humans naturally recall through episodic cues like when, where, and in what context information was encountered. This dissertation presents the Unified Personal Index (UPI), a memory-aligned architecture that bridges this fundamental gap. The Indaleko prototype demonstrates the UPI's feasibility on a 31-million file dataset spanning 160TB across eight storage platforms. By integrating temporal, spatial, and activity metadata into a unified graph database, Indaleko enables natural language queries like "photos near the conference venue last spring" that existing systems cannot process. The implementation achieves sub-second query responses through memory anchor indexing, eliminates cross-platform search fragmentation, and maintains perfect precision for well-specified memory patterns. Evaluation against commercial systems (Google Drive, OneDrive, Dropbox, Windows Search) reveals that all fail on memory-based queries, returning overwhelming result sets without contextual filtering. In contrast, Indaleko successfully processes multi-dimensional queries combining time, location, and activity patterns. The extensible architecture supports rapid integration of new data sources (10 minutes to 10 hours per provider) while preserving privacy through UUID-based semantic decoupling. The UPI's architectural synthesis bridges cognitive theory with distributed systems design, as demonstrated through the Indaleko prototype and rigorous evaluation. This work transforms personal information retrieval from keyword matching to memory-aligned finding, providing immediate benefits for existing data while establishing foundations for future context-aware systems.
Paper Structure (337 sections, 2 equations, 13 figures, 21 tables)

This paper contains 337 sections, 2 equations, 13 figures, 21 tables.

Figures (13)

  • Figure 1: Results Overlaps for query "Anth 394" Across Cloud Platforms (Dropbox, Google Drive, OneDrive) Note: This diagram shows overlaps among precision-focused results---files the data owner identified as actually relevant to the query (5 Dropbox, 12 Google Drive, 7 OneDrive files). This represents the subset of platform results that users would consider correct, highlighting how few relevant files overlap between platforms despite searching identical datasets. For recall-focused results (all files returned by platforms), see \ref{['fig:upset-plot']}.
  • Figure 2: UpSet plot of search results for query "Anth 394" across Dropbox, Google Drive, OneDrive, and iCloud (via Finder) in December 2024. Top bars show all files each API returned (recall-focused counts: 15, 21, 16, 34). The matrix with vertical bars encodes intersections, scaling better than a Venn diagram for $>3$ datasets and exposing asymmetries (unique vs. shared files). Compared with \ref{['fig:venn-diagram']} (precision-focused, user‑judged relevant files), this highlights the precision gap: platforms return 2--5× more items than users deem relevant. Sparse overlap plus high per‑platform noise reveals reliance on platform-specific keyword matching instead of memory-aligned retrieval that would link "Anth 394" to related forensic anthropology materials via semantic, temporal, and episodic cues.
  • Figure 3: Architecture of the Unified Personal Index showing the core components and data flow. From the top, data arrives from a heterogeneous set of sources, including storage services, semantic transducers, and activity stream providers. Related state is formed into a memory anchor, which may be linked back to the processed data. Raw data is preserved and normalized into the index layer. This system then provides query support against the index, including providing current dynamic information about what normalized data is available from the UPI.
  • Figure 4: Indaleko implementation diagram showing: (1) ArangoDB database and collections; (2) Local file systems (NTFS, APFS, EXT4); (3) Cloud storage services (iCloud, Dropbox, Google Drive, OneDrive); (4) Activity stream sources (Location, Email, Music, Collaboration, Ambient, Query, Storage). Memory anchors permit creating relationships across these diverse data sources, which allows knowledge-graph construction over time.
  • Figure 5: Indaleko Collector/Recorder Pipeline: The diagram illustrates the modular data ingestion architecture of the Unified Personal Index. Collectors gather data from diverse sources with source-specific formats, while recorders normalize and persist this data with semantic UUID mappings. Three communication patterns (direct transfer, queue-based, and batch file) enable flexible integration. Red dashed lines indicate which communication pattern each collector typically uses. The data provider pattern (purple-green dashed outline) shows where collectors and recorders are tightly coupled for certain sources. All normalized data is stored in ArangoDB with both the original compressed content and semantically-mapped fields.
  • ...and 8 more figures