OmniQuery: Contextually Augmenting Captured Multimodal Memory to Enable Personal Question Answering
Jiahao Nick Li, Zhuohao Jerry Zhang, Jiaju Ma
TL;DR
OmniQuery tackles the challenge of answering complex, memory-based questions over personal multimodal captures by introducing a taxonomy-driven memory augmentation pipeline and an LLM-grounded QA system. Grounded in a diary-study-derived taxonomy of atomic context, composite context, and semantic knowledge, the approach augments memories across multiple related items and retrieves relevant augmented data to generate answers with references. In a user study, OmniQuery outperformed a baseline retrieval-augmented system, achieving 71.5% accuracy and winning or tying in the majority of comparisons, demonstrating stronger capability for handling hybrid and context-rich queries. The work highlights practical implications for private, multimodal memory assistants, while outlining future work on multilingual, multimodal interaction, privacy, and benchmarking. The contributions include a taxonomy of contextual information, a three-stage augmentation pipeline, an end-to-end QA system, and an empirical evaluation supporting effectiveness against a baseline.
Abstract
People often capture memories through photos, screenshots, and videos. While existing AI-based tools enable querying this data using natural language, they only support retrieving individual pieces of information like certain objects in photos, and struggle with answering more complex queries that involve interpreting interconnected memories like sequential events. We conducted a one-month diary study to collect realistic user queries and generated a taxonomy of necessary contextual information for integrating with captured memories. We then introduce OmniQuery, a novel system that is able to answer complex personal memory-related questions that require extracting and inferring contextual information. OmniQuery augments individual captured memories through integrating scattered contextual information from multiple interconnected memories. Given a question, OmniQuery retrieves relevant augmented memories and uses a large language model (LLM) to generate answers with references. In human evaluations, we show the effectiveness of OmniQuery with an accuracy of 71.5%, outperforming a conventional RAG system by winning or tying for 74.5% of the time.
