An Egocentric Vision-Language Model based Portable Real-time Smart Assistant

Yifei Huang; Jilan Xu; Baoqi Pei; Yuping He; Guo Chen; Mingfang Zhang; Lijin Yang; Zheng Nie; Jinyao Liu; Guoshun Fan; Dechen Lin; Fang Fang; Kunpeng Li; Chang Yuan; Xinyuan Chen; Yaohui Wang; Yali Wang; Yu Qiao; Limin Wang

An Egocentric Vision-Language Model based Portable Real-time Smart Assistant

Yifei Huang, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Mingfang Zhang, Lijin Yang, Zheng Nie, Jinyao Liu, Guoshun Fan, Dechen Lin, Fang Fang, Kunpeng Li, Chang Yuan, Xinyuan Chen, Yaohui Wang, Yali Wang, Yu Qiao, Limin Wang

TL;DR

This work tackles the challenge of real-time, on-device egocentric AI by introducing Vinci, a hardware-agnostic vision-language system. At its core is EgoVideo-VL, a multimodal model that fuses an egocentric vision foundation with a large language model, augmented by a memory module, a generation module for visual demonstrations, and a retrieval module bridging egocentric and third-person content. The authors train EgoVideo-VL with curated egocentric datasets via two-stage instruction fine-tuning and LoRA, while adding memory, generation, and retrieval capabilities to enable contextual chatting, temporal grounding, summarization, future planning, action prediction, and video retrieval. Comprehensive evaluations—benchmarks on EgoVideo-VL and extensive in-situ user studies—demonstrate Vinci’s effectiveness in real-world, portable scenarios and its potential to transform on-device egocentric AI for learning, assistance, and task guidance. The work provides open-source access to Vinci, enabling broader adoption and further development of portable, real-time egocentric AI systems.

Abstract

We present Vinci, a vision-language system designed to provide real-time, comprehensive AI assistance on portable devices. At its core, Vinci leverages EgoVideo-VL, a novel model that integrates an egocentric vision foundation model with a large language model (LLM), enabling advanced functionalities such as scene understanding, temporal grounding, video summarization, and future planning. To enhance its utility, Vinci incorporates a memory module for processing long video streams in real time while retaining contextual history, a generation module for producing visual action demonstrations, and a retrieval module that bridges egocentric and third-person perspectives to provide relevant how-to videos for skill acquisition. Unlike existing systems that often depend on specialized hardware, Vinci is hardware-agnostic, supporting deployment across a wide range of devices, including smartphones and wearable cameras. In our experiments, we first demonstrate the superior performance of EgoVideo-VL on multiple public benchmarks, showcasing its vision-language reasoning and contextual understanding capabilities. We then conduct a series of user studies to evaluate the real-world effectiveness of Vinci, highlighting its adaptability and usability in diverse scenarios. We hope Vinci can establish a new framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights. Including the frontend, backend, and models, all codes of Vinci are available at https://github.com/OpenGVLab/vinci.

An Egocentric Vision-Language Model based Portable Real-time Smart Assistant

TL;DR

Abstract

An Egocentric Vision-Language Model based Portable Real-time Smart Assistant

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)