PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents
Mikhail Menschikov, Dmitry Evseev, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin, Petr Anokhin, Evgeny Burnaev, Nikita Semenov
TL;DR
This work tackles the challenge of long-horizon personalization for large language models by introducing a flexible external memory grounded in a knowledge graph. Built on the AriGraph framework, the system integrates semantic (object, thesis) and episodic memory with hyper-edges, enabling rich temporal and relational reasoning. It offers multiple retrieval algorithms (A*, WaterCircles, BeamSearch and hybrids) and demonstrates how memory configuration interacts with model scale to affect QA performance across DiaASQ, HotpotQA, and TriviaQA, including temporal and contradictory information. The results show that thesis memories are highly informative for small-to-medium models, while hybrid retrieval strategies provide stability for larger models, and that this graph-based memory framework can surpass GraphRAG in certain settings while remaining competitive with RAG baselines when appropriately tuned. The findings illuminate how structured memory and flexible retrieval can enhance personalized, context-aware reasoning at scale, and point to future directions in temporal filtering, distributed memory storage, and privacy-preserving retrieval.
Abstract
Personalizing language models that effectively incorporating user interaction history remains a central challenge in development of adaptive AI systems. While large language models (LLMs), combined with Retrieval-Augmented Generation (RAG), have improved factual accuracy, they often lack structured memory and fail to scale in complex, long-term interactions. To address this, we propose a flexible external memory framework based on knowledge graph, which construct and update memory model automatically by LLM itself. Building upon the AriGraph architecture, we introduce a novel hybrid graph design that supports both standard edges and two types of hyper-edges, enabling rich and dynamic semantic and temporal representations. Our framework also supports diverse retrieval mechanisms, including A*, water-circle traversal, beam search and hybrid methods, making it adaptable to different datasets and LLM capacities. We evaluate our system on three benchmarks: TriviaQA, HotpotQA, DiaASQ and demonstrate that different memory and retrieval configurations yield optimal performance depending on the task. Additionally, we extend the DiaASQ benchmark with temporal annotations and internally contradictory statements, showing that our system remains robust and effective in managing temporal dependencies and context-aware reasoning.
