Table of Contents
Fetching ...

Encode-Store-Retrieve: Augmenting Human Memory through Language-Encoded Egocentric Perception

Junxiao Shen, John Dudley, Per Ola Kristensson

TL;DR

A memory augmentation agent that involves leveraging natural language encoding for video data and storing them in a vector database is proposed, which results in significantly better recall performance on episodic memory tasks compared to human participants.

Abstract

We depend on our own memory to encode, store, and retrieve our experiences. However, memory lapses can occur. One promising avenue for achieving memory augmentation is through the use of augmented reality head-mounted displays to capture and preserve egocentric videos, a practice commonly referred to as lifelogging. However, a significant challenge arises from the sheer volume of video data generated through lifelogging, as the current technology lacks the capability to encode and store such large amounts of data efficiently. Further, retrieving specific information from extensive video archives requires substantial computational power, further complicating the task of quickly accessing desired content. To address these challenges, we propose a memory augmentation agent that involves leveraging natural language encoding for video data and storing them in a vector database. This approach harnesses the power of large vision language models to perform the language encoding process. Additionally, we propose using large language models to facilitate natural language querying. Our agent underwent extensive evaluation using the QA-Ego4D dataset and achieved state-of-the-art results with a BLEU score of 8.3, outperforming conventional machine learning models that scored between 3.4 and 5.8. Additionally, we conducted a user study in which participants interacted with the human memory augmentation agent through episodic memory and open-ended questions. The results of this study show that the agent results in significantly better recall performance on episodic memory tasks compared to human participants. The results also highlight the agent's practical applicability and user acceptance.

Encode-Store-Retrieve: Augmenting Human Memory through Language-Encoded Egocentric Perception

TL;DR

A memory augmentation agent that involves leveraging natural language encoding for video data and storing them in a vector database is proposed, which results in significantly better recall performance on episodic memory tasks compared to human participants.

Abstract

We depend on our own memory to encode, store, and retrieve our experiences. However, memory lapses can occur. One promising avenue for achieving memory augmentation is through the use of augmented reality head-mounted displays to capture and preserve egocentric videos, a practice commonly referred to as lifelogging. However, a significant challenge arises from the sheer volume of video data generated through lifelogging, as the current technology lacks the capability to encode and store such large amounts of data efficiently. Further, retrieving specific information from extensive video archives requires substantial computational power, further complicating the task of quickly accessing desired content. To address these challenges, we propose a memory augmentation agent that involves leveraging natural language encoding for video data and storing them in a vector database. This approach harnesses the power of large vision language models to perform the language encoding process. Additionally, we propose using large language models to facilitate natural language querying. Our agent underwent extensive evaluation using the QA-Ego4D dataset and achieved state-of-the-art results with a BLEU score of 8.3, outperforming conventional machine learning models that scored between 3.4 and 5.8. Additionally, we conducted a user study in which participants interacted with the human memory augmentation agent through episodic memory and open-ended questions. The results of this study show that the agent results in significantly better recall performance on episodic memory tasks compared to human participants. The results also highlight the agent's practical applicability and user acceptance.
Paper Structure (30 sections, 5 figures, 2 tables)

This paper contains 30 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The Egocentric Vision-Language Model is developed through a process called fine-tuning. This process involves extracting knowledge from a large model and transferring it to smaller models, resulting in improved accuracy and faster inference times. The Egocentric Vision-Language Model combines the power of vision and language to effectively process and understand egocentric video data. 13B and 7B refer to large language models with 13 billion and 7 billion parameters.
  • Figure 2: An example QA pair from the QA-Ego4D dataset adopted from barmann2022did.
  • Figure 3: The settings of the different scenarios and the duration of each scenario.
  • Figure 4: Comparative analysis of scores for the memory augmentation agent and Human responses across various questions. Each question has multiple pairs of AI and human scores represented by the bars. The $x$-axis enumerates different questions, while the $y$-axis shows the scores ranging from 1 to 5. The bars are color-coded, with one color representing AI and another representing human scores. The legend on the top-right corner outside the plot area distinguishes between AI and human bars.
  • Figure 5: The five-point Likert responses to the post-study questionnaire. Q1. The memory augmentation capability is valuable; Q2. The information provided by the memory augmentation agent is accurate; Q3. The response to my open-ended question by the memory augmentation agent is creative; Q4. I am willing to wear an always-on camera for memory augmentation through language encoding; Q5. I am willing for others in my close vicinity to wear an always-on camera for memory augmentation through language encoding.