Table of Contents
Fetching ...

OmniQuery: Contextually Augmenting Captured Multimodal Memory to Enable Personal Question Answering

Jiahao Nick Li, Zhuohao Jerry Zhang, Jiaju Ma

TL;DR

OmniQuery tackles the challenge of answering complex, memory-based questions over personal multimodal captures by introducing a taxonomy-driven memory augmentation pipeline and an LLM-grounded QA system. Grounded in a diary-study-derived taxonomy of atomic context, composite context, and semantic knowledge, the approach augments memories across multiple related items and retrieves relevant augmented data to generate answers with references. In a user study, OmniQuery outperformed a baseline retrieval-augmented system, achieving 71.5% accuracy and winning or tying in the majority of comparisons, demonstrating stronger capability for handling hybrid and context-rich queries. The work highlights practical implications for private, multimodal memory assistants, while outlining future work on multilingual, multimodal interaction, privacy, and benchmarking. The contributions include a taxonomy of contextual information, a three-stage augmentation pipeline, an end-to-end QA system, and an empirical evaluation supporting effectiveness against a baseline.

Abstract

People often capture memories through photos, screenshots, and videos. While existing AI-based tools enable querying this data using natural language, they only support retrieving individual pieces of information like certain objects in photos, and struggle with answering more complex queries that involve interpreting interconnected memories like sequential events. We conducted a one-month diary study to collect realistic user queries and generated a taxonomy of necessary contextual information for integrating with captured memories. We then introduce OmniQuery, a novel system that is able to answer complex personal memory-related questions that require extracting and inferring contextual information. OmniQuery augments individual captured memories through integrating scattered contextual information from multiple interconnected memories. Given a question, OmniQuery retrieves relevant augmented memories and uses a large language model (LLM) to generate answers with references. In human evaluations, we show the effectiveness of OmniQuery with an accuracy of 71.5%, outperforming a conventional RAG system by winning or tying for 74.5% of the time.

OmniQuery: Contextually Augmenting Captured Multimodal Memory to Enable Personal Question Answering

TL;DR

OmniQuery tackles the challenge of answering complex, memory-based questions over personal multimodal captures by introducing a taxonomy-driven memory augmentation pipeline and an LLM-grounded QA system. Grounded in a diary-study-derived taxonomy of atomic context, composite context, and semantic knowledge, the approach augments memories across multiple related items and retrieves relevant augmented data to generate answers with references. In a user study, OmniQuery outperformed a baseline retrieval-augmented system, achieving 71.5% accuracy and winning or tying in the majority of comparisons, demonstrating stronger capability for handling hybrid and context-rich queries. The work highlights practical implications for private, multimodal memory assistants, while outlining future work on multilingual, multimodal interaction, privacy, and benchmarking. The contributions include a taxonomy of contextual information, a three-stage augmentation pipeline, an end-to-end QA system, and an empirical evaluation supporting effectiveness against a baseline.

Abstract

People often capture memories through photos, screenshots, and videos. While existing AI-based tools enable querying this data using natural language, they only support retrieving individual pieces of information like certain objects in photos, and struggle with answering more complex queries that involve interpreting interconnected memories like sequential events. We conducted a one-month diary study to collect realistic user queries and generated a taxonomy of necessary contextual information for integrating with captured memories. We then introduce OmniQuery, a novel system that is able to answer complex personal memory-related questions that require extracting and inferring contextual information. OmniQuery augments individual captured memories through integrating scattered contextual information from multiple interconnected memories. Given a question, OmniQuery retrieves relevant augmented memories and uses a large language model (LLM) to generate answers with references. In human evaluations, we show the effectiveness of OmniQuery with an accuracy of 71.5%, outperforming a conventional RAG system by winning or tying for 74.5% of the time.
Paper Structure (60 sections, 1 equation, 10 figures, 3 tables)

This paper contains 60 sections, 1 equation, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Number of appearances of each types of context (atomic and composite) in the logged queries. Note that a query may contain multiple types of categories, such as "What boba tea did I drink last week?"
  • Figure 1: Structure of the baseline implementation.
  • Figure 2: Augmenting captured memories involves three steps: (1) structuring memories by processing content and annotating with atomic contexts; (2) identifying composite context through sliding windows; (3) inferring semantic knowledge from the structured memories and identified contexts.
  • Figure 2: Four exemplar failure cases: (a) lack of context, (b) wording ambiguity, (c) information loss during processing and (d) redundancy-induced failure.
  • Figure 3: An example of structuring an individual captured memory (a photo of the Wi-Fi details of CHI 2024 conference).
  • ...and 5 more figures