Encode-Store-Retrieve: Augmenting Human Memory through Language-Encoded Egocentric Perception

Junxiao Shen; John Dudley; Per Ola Kristensson

Encode-Store-Retrieve: Augmenting Human Memory through Language-Encoded Egocentric Perception

Junxiao Shen, John Dudley, Per Ola Kristensson

TL;DR

A memory augmentation agent that involves leveraging natural language encoding for video data and storing them in a vector database is proposed, which results in significantly better recall performance on episodic memory tasks compared to human participants.

Abstract

We depend on our own memory to encode, store, and retrieve our experiences. However, memory lapses can occur. One promising avenue for achieving memory augmentation is through the use of augmented reality head-mounted displays to capture and preserve egocentric videos, a practice commonly referred to as lifelogging. However, a significant challenge arises from the sheer volume of video data generated through lifelogging, as the current technology lacks the capability to encode and store such large amounts of data efficiently. Further, retrieving specific information from extensive video archives requires substantial computational power, further complicating the task of quickly accessing desired content. To address these challenges, we propose a memory augmentation agent that involves leveraging natural language encoding for video data and storing them in a vector database. This approach harnesses the power of large vision language models to perform the language encoding process. Additionally, we propose using large language models to facilitate natural language querying. Our agent underwent extensive evaluation using the QA-Ego4D dataset and achieved state-of-the-art results with a BLEU score of 8.3, outperforming conventional machine learning models that scored between 3.4 and 5.8. Additionally, we conducted a user study in which participants interacted with the human memory augmentation agent through episodic memory and open-ended questions. The results of this study show that the agent results in significantly better recall performance on episodic memory tasks compared to human participants. The results also highlight the agent's practical applicability and user acceptance.

Encode-Store-Retrieve: Augmenting Human Memory through Language-Encoded Egocentric Perception

TL;DR

Abstract

Paper Structure (30 sections, 5 figures, 2 tables)

This paper contains 30 sections, 5 figures, 2 tables.

Introduction
Related Work
Augmented Reality: Augmenting User Perceptions
Memory Augmentation through Lifelogging
Video Content Analysis
Agent Design
Encode
Store and Retrieve
Study 1: Large-Scale Evaluation of the Memory Augmentation Agent
Dataset - QA-Ego4D
Baseline Models
Evaluation Metrics
Results
Study 2: Usability study for open-ended questions
Methodology
...and 15 more sections

Figures (5)

Figure 1: The Egocentric Vision-Language Model is developed through a process called fine-tuning. This process involves extracting knowledge from a large model and transferring it to smaller models, resulting in improved accuracy and faster inference times. The Egocentric Vision-Language Model combines the power of vision and language to effectively process and understand egocentric video data. 13B and 7B refer to large language models with 13 billion and 7 billion parameters.
Figure 2: An example QA pair from the QA-Ego4D dataset adopted from barmann2022did.
Figure 3: The settings of the different scenarios and the duration of each scenario.
Figure 4: Comparative analysis of scores for the memory augmentation agent and Human responses across various questions. Each question has multiple pairs of AI and human scores represented by the bars. The $x$-axis enumerates different questions, while the $y$-axis shows the scores ranging from 1 to 5. The bars are color-coded, with one color representing AI and another representing human scores. The legend on the top-right corner outside the plot area distinguishes between AI and human bars.
Figure 5: The five-point Likert responses to the post-study questionnaire. Q1. The memory augmentation capability is valuable; Q2. The information provided by the memory augmentation agent is accurate; Q3. The response to my open-ended question by the memory augmentation agent is creative; Q4. I am willing to wear an always-on camera for memory augmentation through language encoding; Q5. I am willing for others in my close vicinity to wear an always-on camera for memory augmentation through language encoding.

Encode-Store-Retrieve: Augmenting Human Memory through Language-Encoded Egocentric Perception

TL;DR

Abstract

Encode-Store-Retrieve: Augmenting Human Memory through Language-Encoded Egocentric Perception

Authors

TL;DR

Abstract

Table of Contents

Figures (5)