Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

Yifei Huang; Jilan Xu; Baoqi Pei; Yuping He; Guo Chen; Lijin Yang; Xinyuan Chen; Yaohui Wang; Zheng Nie; Jinyao Liu; Guoshun Fan; Dechen Lin; Fang Fang; Kunpeng Li; Chang Yuan; Yali Wang; Yu Qiao; Limin Wang

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

Yifei Huang, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Lijin Yang, Xinyuan Chen, Yaohui Wang, Zheng Nie, Jinyao Liu, Guoshun Fan, Dechen Lin, Fang Fang, Kunpeng Li, Chang Yuan, Yali Wang, Yu Qiao, Limin Wang

TL;DR

Vinci tackles real-time, context-aware assistance from egocentric video on portable devices by integrating an end-to-end EgoVideo-VL-based pipeline with memory, retrieval, and generation capabilities. The system processes live video/audio streams, maintains historical context for temporal grounding, and provides visual demonstrations alongside textual responses, enabling hands-free practical guidance. Key contributions include the EgoVideo-VL instruction-tuning pipeline over Ego4D/EgoExoLearn/Ego4D-Goalstep, a memory module for temporal grounding, a SEINE-based visual demonstration generator, and a retrieval module for external how-to videos, all released as open-source deployment code. This work demonstrates robust real-time performance across current scene understanding, temporal grounding, video summarization, future planning, and visual task demonstrations, offering a practical foundation for portable egocentric AI applications.

Abstract

We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-language model. Designed for deployment on portable devices such as smartphones and wearable cameras, Vinci operates in an "always on" mode, continuously observing the environment to deliver seamless interaction and assistance. Users can wake up the system and engage in natural conversations to ask questions or seek assistance, with responses delivered through audio for hands-free convenience. With its ability to process long video streams in real-time, Vinci can answer user queries about current observations and historical context while also providing task planning based on past interactions. To further enhance usability, Vinci integrates a video generation module that creates step-by-step visual demonstrations for tasks that require detailed guidance. We hope that Vinci can establish a robust framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights. We release the complete implementation for the development of the device in conjunction with a demo web platform to test uploaded videos at https://github.com/OpenGVLab/vinci.

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

TL;DR

Abstract

Paper Structure (25 sections, 14 figures)

This paper contains 25 sections, 14 figures.

Introduction
Related Work
Egocentric vision
Vision-language models
Streaming video understanding
Method
Input processing
EgoVideo-VL
Memory module
Generation module
Retrieval Module
System integration
Experiments
Qualitative analysis
Current scene undersanding
...and 10 more sections

Figures (14)

Figure 1: Overview of Vinci’s capabilities demonstrated through a streaming video timeline. At different timestamps, Vinci showcases its diverse real-time abilities: (1) Current scene understanding—providing detailed analysis of the ongoing activity and environment; (2) Temporal grounding—retrieving and referencing past events based on user queries; (3) Video summarization—offering concise summaries of key actions over time; (4) Future planning—predicting upcoming steps or actions based on historical context and current observations; and (5) Action prediction—generating a visual demonstration of the next likely action to assist users in task completion.
Figure 2: The overall structure of the EgoVideo-VL model. The visual encoder leverages the egocentric video foundation model, EgoVideo, while the LLM component utilizes InternLM. The memory module periodically processes video content, saving detailed descriptions and corresponding timestamps for historical context. The generation module predicts actions or creates visual demonstrations based on the current video frame and user prompts. The retrieval module retrieves third-person perspective videos, enabling users to watch and imitate skill-related tasks.
Figure 3: Overview of the Vinci system. The system integrates four components: the camera, frontend, backend, and models. The frontend is a web-based interface that displays the video stream and plays audio generated by TTS from the model's responses. The backend acts as the central hub, managing communication between the frontend, camera, and models. It listens for wake-up commands and, upon detection, activates the EgoVideo-VL model to process user prompts and deliver the corresponding outputs.
Figure 4: Real-world deployment of Vinci. (a) The deployed system on a OnePlus smartphone mounted on the user's head. (b) The web-based frontend. The displayed model output will also be played by audio for seamless interaction.
Figure 5: Example of Vinci’s ability to analyze the current video state and accurately respond to user queries. In this scenario, at 35.2 seconds, Vinci correctly identifies the ongoing action as cutting carrots. At 94.7 seconds, Vinci accurately recognizes the current scene, confirming that only one egg is being held in hand.
...and 9 more figures

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

TL;DR

Abstract

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (14)