Table of Contents
Fetching ...

On the Application of Egocentric Computer Vision to Industrial Scenarios

Vivek Chavan, Oliver Heimann, Jörg Krüger

TL;DR

The paper addresses the gap between industrial digitisation and modern AI by proposing egocentric vision using lightweight wearables to collect multimodal data from a first-person perspective. It introduces a pipeline where user-provided natural-language observations guide processing of synchronized video, eye-gaze, and hand data to produce rich labels, aided by a language-model that generates structured metadata; the approach is complemented by contextual cues such as trajectory and location. Key contributions include a detailed automated data collection and labeling workflow and a federated, three-layer continual-learning framework to handle privacy and incremental updates across personal, organizational, and global levels. The work aims to reduce digitisation effort, enhance knowledge transfer, and enable context-aware models in industrial environments, with practical implications for operator guidance, defect labeling, and workflow understanding.

Abstract

Egocentric vision aims to capture and analyse the world from the first-person perspective. We explore the possibilities for egocentric wearable devices to improve and enhance industrial use cases w.r.t. data collection, annotation, labelling and downstream applications. This would contribute to easier data collection and allow users to provide additional context. We envision that this approach could serve as a supplement to the traditional industrial Machine Vision workflow. Code, Dataset and related resources will be available at: https://github.com/Vivek9Chavan/EgoVis24

On the Application of Egocentric Computer Vision to Industrial Scenarios

TL;DR

The paper addresses the gap between industrial digitisation and modern AI by proposing egocentric vision using lightweight wearables to collect multimodal data from a first-person perspective. It introduces a pipeline where user-provided natural-language observations guide processing of synchronized video, eye-gaze, and hand data to produce rich labels, aided by a language-model that generates structured metadata; the approach is complemented by contextual cues such as trajectory and location. Key contributions include a detailed automated data collection and labeling workflow and a federated, three-layer continual-learning framework to handle privacy and incremental updates across personal, organizational, and global levels. The work aims to reduce digitisation effort, enhance knowledge transfer, and enable context-aware models in industrial environments, with practical implications for operator guidance, defect labeling, and workflow understanding.

Abstract

Egocentric vision aims to capture and analyse the world from the first-person perspective. We explore the possibilities for egocentric wearable devices to improve and enhance industrial use cases w.r.t. data collection, annotation, labelling and downstream applications. This would contribute to easier data collection and allow users to provide additional context. We envision that this approach could serve as a supplement to the traditional industrial Machine Vision workflow. Code, Dataset and related resources will be available at: https://github.com/Vivek9Chavan/EgoVis24
Paper Structure (4 sections, 3 figures)

This paper contains 4 sections, 3 figures.

Figures (3)

  • Figure 1: A comparison of the two approaches. Our work explores the latter.
  • Figure 2: A summary of the proposed pipeline. The User/SME wearing the egocentric device interacts with the object/machinery and documents their observation in natural language. The multimodal dataset is then processed to obtain image/video data, and the transcription, eye-gaze, hand interaction provides the labels and annotations, along with metadata. Top: Point cloud reconstruction example from a use case. Bottom: A conceptualisation of the data processing.
  • Figure 3: A summary of a distributed Continual Learning framework for egocentric applications. The three layers of application include personal (top), organisational (middle) and global (bottom). The most sensitive information is stored and processed by the personal computing setup with limited compute. The organisational layer trains the local models incrementally, which receive user feedback and related data from the egocentric device. The global foundational models require large amounts of data, which could be periodically shared by the organisation (after anonymization and review).