Table of Contents
Fetching ...

TeleEgo: Benchmarking Egocentric AI Assistants in the Wild

Jiaqi Yan, Ruilong Ren, Jingren Liu, Shuning Xu, Ling Wang, Yiheng Wang, Xinlin Zhong, Yun Wang, Long Zhang, Xiangyu Chen, Changzhi Sun, Jixiang Luo, Dell Zhang, Hao Sun, Chi Zhang, Xuelong Li

TL;DR

TeleEgo addresses the gap in evaluating egocentric AI assistants under realistic, long-duration streaming with omni-modal inputs. It provides a long-duration, synchronized dataset (over 14 hours per participant) across four domains, with 3,291 QA items and 12 diagnostic subtasks spanning Memory, Understanding, and Cross-Memory Reasoning. The paper introduces Real-Time Accuracy (RTA) and Memory Persistence Time (MPT) as metrics to capture correctness, timing, and long-term memory in continuous streams, and presents evaluation protocols and baseline results for current models. TeleEgo offers an extensible benchmark for research on real-time behavior and long-horizon memory in first-person AI assistants.

Abstract

Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce \textbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work \& study, lifestyle \& routines, social activities, and outings \& culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human refinement.TeleEgo defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose Real-Time Accuracy (RTA) to jointly capture correctness and responsiveness under tight decision windows, and Memory Persistence Time (MPT) as a forward-looking metric for long-term retention in continuous streams. In this work, we report RTA results for current models and release TeleEgo, together with an MPT evaluation framework, as a realistic and extensible benchmark for future egocentric assistants with stronger streaming memory, enabling systematic study of both real-time behavior and long-horizon memory.

TeleEgo: Benchmarking Egocentric AI Assistants in the Wild

TL;DR

TeleEgo addresses the gap in evaluating egocentric AI assistants under realistic, long-duration streaming with omni-modal inputs. It provides a long-duration, synchronized dataset (over 14 hours per participant) across four domains, with 3,291 QA items and 12 diagnostic subtasks spanning Memory, Understanding, and Cross-Memory Reasoning. The paper introduces Real-Time Accuracy (RTA) and Memory Persistence Time (MPT) as metrics to capture correctness, timing, and long-term memory in continuous streams, and presents evaluation protocols and baseline results for current models. TeleEgo offers an extensible benchmark for research on real-time behavior and long-horizon memory in first-person AI assistants.

Abstract

Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce \textbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work \& study, lifestyle \& routines, social activities, and outings \& culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human refinement.TeleEgo defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose Real-Time Accuracy (RTA) to jointly capture correctness and responsiveness under tight decision windows, and Memory Persistence Time (MPT) as a forward-looking metric for long-term retention in continuous streams. In this work, we report RTA results for current models and release TeleEgo, together with an MPT evaluation framework, as a realistic and extensible benchmark for future egocentric assistants with stronger streaming memory, enabling systematic study of both real-time behavior and long-horizon memory.

Paper Structure

This paper contains 23 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: An overview of the TeleEgo project. Top-left: Scripted real-world wearable-camera scenarios covering multiple roles, scenes, and themes. Top-right: Omni-modal egocentric streaming data aligned to a shared timeline, comprising video, audio, and human-curated speech transcripts and visual narrations. Online multitask QA benchmark organized across three capability dimensions (Memory, Understanding, Cross-Memory Reasoning), containing 3,291 QA items across 4 question types. Bottom-right: Long-term streaming video pipeline—egocentric footage with query-time retrieval spanning seconds to days.
  • Figure 2: Scenario design and activity distribution in TeleEgo dataset. Each role engaged in diverse first-person activities across three recording days, systematically covering four themes, Work & Study, Lifestyle & Routines, Social Activities, and Outings & Culture. The design spans a wide spectrum of cognitive and social contexts, combining solo and multi-role interactions across indoor and outdoor environments. This structure ensures ecological diversity and supports analyses of long-term, cross-situational understanding.
  • Figure 3: TeleEgo construction pipeline. Step 1: egocentric video capture across 5 roles, 4 themes and 3 days. Step 2: data processing into synchronized video, audio, speech, and narration captions. Step 3: AI tools generate candidate QA items from task descriptions and captions, followed by human verification.
  • Figure 4: Hierarchical organization of the TeleEgo benchmark. The benchmark is organized around three cognitive dimensions: Memory, Understanding, and Cross-Memory Reasoning. Each dimension is further divided into fine-grained subcategories.
  • Figure 5: TeleEgo benchmark comprises twelve QA subcategories, illustrated here with one example per subcategory.
  • ...and 1 more figures