Table of Contents
Fetching ...

Recording First-person Experiences to Build a New Type of Foundation Model

Dionis Barcari, David Gamez, Aliya Grig

TL;DR

The paper argues that current foundation models trained on Internet data may fail to capture real human behavior due to lacking emotional and physiological grounding. It proposes a first-person recorder that collects environmental stimuli alongside multimodal signals (14-channel EEG, GSR, facial expressions) and processes this data with cloud analytics and a DES sampling mechanism to enable training first-person foundation models (FPFMs). These FPFMs would map environmental inputs to emotional/physiological states and these states to external behavior, enabling more realistic personalization, dialog, and agent interfaces. The work discusses data requirements, potential applications (recommendations, dating, recruitment, GAN feedback, and actor-specific dialogue), privacy considerations, and a path toward startup funding to scale data collection and model training. If successful, FPFPs could significantly advance personalized AI that better reflects real human minds, while highlighting privacy and copyright challenges that must be addressed before widespread deployment.

Abstract

Foundation models have had a big impact in recent years and billions of dollars are being invested in them in the current AI boom. The more popular ones, such as Chat-GPT, are trained on large amounts of Internet data. However, it is becoming apparent that this data is likely to be exhausted soon, and technology companies are looking for new sources of data to train the next generation of foundation models. Reinforcement learning, RAG, prompt engineering and cognitive modelling are often used to fine-tune and augment the behaviour of foundation models. These techniques have been used to replicate people, such as Caryn Marjorie. These chatbots are not based on people's actual emotional and physiological responses to their environment, so they are, at best, a surface-level approximation to the characters they are imitating. To address these issues, we have developed a recording rig that captures what the wearer is seeing and hearing as well as their skin conductance (GSR), facial expression and brain state (14 channel EEG). AI algorithms are used to process this data into a rich picture of the environment and internal states of the subject. Foundation models trained on this data could replicate human behaviour much more accurately than the personality models that have been developed so far. This type of model has many potential applications, including recommendation, personal assistance, GAN systems, dating and recruitment. This paper gives some background to this work and describes the recording rig and preliminary tests of its functionality. It then suggests how a new type of foundation model could be created from the data captured by the rig and outlines some applications. Data gathering and model training are expensive, so we are currently working on the launch of a start-up that could raise funds for the next stage of the project.

Recording First-person Experiences to Build a New Type of Foundation Model

TL;DR

The paper argues that current foundation models trained on Internet data may fail to capture real human behavior due to lacking emotional and physiological grounding. It proposes a first-person recorder that collects environmental stimuli alongside multimodal signals (14-channel EEG, GSR, facial expressions) and processes this data with cloud analytics and a DES sampling mechanism to enable training first-person foundation models (FPFMs). These FPFMs would map environmental inputs to emotional/physiological states and these states to external behavior, enabling more realistic personalization, dialog, and agent interfaces. The work discusses data requirements, potential applications (recommendations, dating, recruitment, GAN feedback, and actor-specific dialogue), privacy considerations, and a path toward startup funding to scale data collection and model training. If successful, FPFPs could significantly advance personalized AI that better reflects real human minds, while highlighting privacy and copyright challenges that must be addressed before widespread deployment.

Abstract

Foundation models have had a big impact in recent years and billions of dollars are being invested in them in the current AI boom. The more popular ones, such as Chat-GPT, are trained on large amounts of Internet data. However, it is becoming apparent that this data is likely to be exhausted soon, and technology companies are looking for new sources of data to train the next generation of foundation models. Reinforcement learning, RAG, prompt engineering and cognitive modelling are often used to fine-tune and augment the behaviour of foundation models. These techniques have been used to replicate people, such as Caryn Marjorie. These chatbots are not based on people's actual emotional and physiological responses to their environment, so they are, at best, a surface-level approximation to the characters they are imitating. To address these issues, we have developed a recording rig that captures what the wearer is seeing and hearing as well as their skin conductance (GSR), facial expression and brain state (14 channel EEG). AI algorithms are used to process this data into a rich picture of the environment and internal states of the subject. Foundation models trained on this data could replicate human behaviour much more accurately than the personality models that have been developed so far. This type of model has many potential applications, including recommendation, personal assistance, GAN systems, dating and recruitment. This paper gives some background to this work and describes the recording rig and preliminary tests of its functionality. It then suggests how a new type of foundation model could be created from the data captured by the rig and outlines some applications. Data gathering and model training are expensive, so we are currently working on the launch of a start-up that could raise funds for the next stage of the project.
Paper Structure (13 sections, 4 figures, 3 tables)

This paper contains 13 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: First-person recorder.
  • Figure 2: Architecture. The Raspberry Pi worn around the user's neck collects data from the camera, microphone and GSR sensor and sends it to a web service running on a laptop worn on the user's back. The Emotiv Epoc X EEG headset sends data to the laptop using the Emotiv Cortex API. A speaker attached to the Raspberry Pi emits a tone at random intervals for descriptive experience sampling (DES) Hurlburt2006. A data processing program running on the laptop collects the data and uses Amazon Web Services (AWS) to identify text, sentiment and image labels. When the data for the file is complete, a hash of the file contents is sent to another cloud web service, which stores the original hash and sends back a new hash that combines the original hash with a random number. A web interface is provided to configure the recorder and play back recorded data.
  • Figure 3: Web interface for recorder.
  • Figure 4: Mean arousal levels in experiments by Crone et al. Crone2018 and arousal levels recorded from a subject while viewing a sample of images from the SMID data set.