Table of Contents
Fetching ...

An Egocentric Vision-Language Model based Portable Real-time Smart Assistant

Yifei Huang, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Mingfang Zhang, Lijin Yang, Zheng Nie, Jinyao Liu, Guoshun Fan, Dechen Lin, Fang Fang, Kunpeng Li, Chang Yuan, Xinyuan Chen, Yaohui Wang, Yali Wang, Yu Qiao, Limin Wang

TL;DR

This work tackles the challenge of real-time, on-device egocentric AI by introducing Vinci, a hardware-agnostic vision-language system. At its core is EgoVideo-VL, a multimodal model that fuses an egocentric vision foundation with a large language model, augmented by a memory module, a generation module for visual demonstrations, and a retrieval module bridging egocentric and third-person content. The authors train EgoVideo-VL with curated egocentric datasets via two-stage instruction fine-tuning and LoRA, while adding memory, generation, and retrieval capabilities to enable contextual chatting, temporal grounding, summarization, future planning, action prediction, and video retrieval. Comprehensive evaluations—benchmarks on EgoVideo-VL and extensive in-situ user studies—demonstrate Vinci’s effectiveness in real-world, portable scenarios and its potential to transform on-device egocentric AI for learning, assistance, and task guidance. The work provides open-source access to Vinci, enabling broader adoption and further development of portable, real-time egocentric AI systems.

Abstract

We present Vinci, a vision-language system designed to provide real-time, comprehensive AI assistance on portable devices. At its core, Vinci leverages EgoVideo-VL, a novel model that integrates an egocentric vision foundation model with a large language model (LLM), enabling advanced functionalities such as scene understanding, temporal grounding, video summarization, and future planning. To enhance its utility, Vinci incorporates a memory module for processing long video streams in real time while retaining contextual history, a generation module for producing visual action demonstrations, and a retrieval module that bridges egocentric and third-person perspectives to provide relevant how-to videos for skill acquisition. Unlike existing systems that often depend on specialized hardware, Vinci is hardware-agnostic, supporting deployment across a wide range of devices, including smartphones and wearable cameras. In our experiments, we first demonstrate the superior performance of EgoVideo-VL on multiple public benchmarks, showcasing its vision-language reasoning and contextual understanding capabilities. We then conduct a series of user studies to evaluate the real-world effectiveness of Vinci, highlighting its adaptability and usability in diverse scenarios. We hope Vinci can establish a new framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights. Including the frontend, backend, and models, all codes of Vinci are available at https://github.com/OpenGVLab/vinci.

An Egocentric Vision-Language Model based Portable Real-time Smart Assistant

TL;DR

This work tackles the challenge of real-time, on-device egocentric AI by introducing Vinci, a hardware-agnostic vision-language system. At its core is EgoVideo-VL, a multimodal model that fuses an egocentric vision foundation with a large language model, augmented by a memory module, a generation module for visual demonstrations, and a retrieval module bridging egocentric and third-person content. The authors train EgoVideo-VL with curated egocentric datasets via two-stage instruction fine-tuning and LoRA, while adding memory, generation, and retrieval capabilities to enable contextual chatting, temporal grounding, summarization, future planning, action prediction, and video retrieval. Comprehensive evaluations—benchmarks on EgoVideo-VL and extensive in-situ user studies—demonstrate Vinci’s effectiveness in real-world, portable scenarios and its potential to transform on-device egocentric AI for learning, assistance, and task guidance. The work provides open-source access to Vinci, enabling broader adoption and further development of portable, real-time egocentric AI systems.

Abstract

We present Vinci, a vision-language system designed to provide real-time, comprehensive AI assistance on portable devices. At its core, Vinci leverages EgoVideo-VL, a novel model that integrates an egocentric vision foundation model with a large language model (LLM), enabling advanced functionalities such as scene understanding, temporal grounding, video summarization, and future planning. To enhance its utility, Vinci incorporates a memory module for processing long video streams in real time while retaining contextual history, a generation module for producing visual action demonstrations, and a retrieval module that bridges egocentric and third-person perspectives to provide relevant how-to videos for skill acquisition. Unlike existing systems that often depend on specialized hardware, Vinci is hardware-agnostic, supporting deployment across a wide range of devices, including smartphones and wearable cameras. In our experiments, we first demonstrate the superior performance of EgoVideo-VL on multiple public benchmarks, showcasing its vision-language reasoning and contextual understanding capabilities. We then conduct a series of user studies to evaluate the real-world effectiveness of Vinci, highlighting its adaptability and usability in diverse scenarios. We hope Vinci can establish a new framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights. Including the frontend, backend, and models, all codes of Vinci are available at https://github.com/OpenGVLab/vinci.

Paper Structure

This paper contains 29 sections, 3 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: (a) Vinci's hardware-agnostic deployment across various devices, including smart glasses, wearable cameras, and smartphones. (b) Vinci's 6 core functionalities: contextual chatting for interactive queries, temporal grounding for reasoning over past events, summarization for concise activity overviews, planning for multi-step task execution, action prediction and video retrieval for providing visible skill demonstrations.
  • Figure 2: Results of the pre-experiment survey. Pie charts (a-d) show the demographics of the survey participants, where (a) is the age distribution, (b) is the familiarity with smart assistants, (c) is the gender, and (d) is the occupation. Bar plot (e) shows the participants' expectations of what features an egocentric smart assistant should have. Bar plot (f) is the result of the participants' thoughts on what aspect of an egocentric smart assistant is the most appealing.
  • Figure 3: Overview of the Vinci system. The left side illustrates Vinci’s system architecture, comprising four key components: (1) Input Processing Module, which receives live video streams and user queries; (2) Backend, which manages communication, query processing, and wake-up keyword detection; (3) EgoVideo-VL Model, which integrates egocentric vision with language understanding for real-time multimodal reasoning; and (4) Frontend, which delivers responses via text, speech, or visual demonstrations. The right side shows examples of real-world hardware deployments, demonstrating Vinci’s versatility across smartphones and wearable cameras for seamless, context-aware assistance in dynamic environments.
  • Figure 4: Overview of the EgoVideo-VL model. EgoVideo-VL is a multimodal vision-language model designed for real-time egocentric understanding and assistance. The model comprises five key components: (1) Modality Encoder, which follows the design of EgoVideo pei2024egovideo and includes a video encoder and a text encoder for multimodal feature extraction; (2) Memory Module, which stores historical context to enable temporal grounding, summarization, and personalized interactions; (3) Large Language Model (LLM), which performs multimodal reasoning and response generation; (4) Generation Module, which synthesizes visual action predictions to guide users through tasks; and (5) Retrieval Module, which retrieves third-person expert demonstrations to complement egocentric understanding.
  • Figure 5: (a) Example of Vinci operating on a head-mounted smartphone. (b) The web-based user interface of Vinci, displaying the live video stream, conversation history, and audio playback.
  • ...and 8 more figures