Table of Contents
Fetching ...

'What did the Robot do in my Absence?' Video Foundation Models to Enhance Intermittent Supervision

Kavindie Katuwandeniya, Leimin Tian, Dana Kulić

TL;DR

It is revealed that query-driven summaries significantly improve retrieval accuracy compared to generic summaries or raw data, albeit with increased task duration, and storyboards is found to be the most effective presentation modality, especially for object-related queries.

Abstract

This paper investigates the application of Video Foundation Models (ViFMs) for generating robot data summaries to enhance intermittent human supervision of robot teams. We propose a novel framework that produces both generic and query-driven summaries of long-duration robot vision data in three modalities: storyboards, short videos, and text. Through a user study involving 30 participants, we evaluate the efficacy of these summary methods in allowing operators to accurately retrieve the observations and actions that occurred while the robot was operating without supervision over an extended duration (40 min). Our findings reveal that query-driven summaries significantly improve retrieval accuracy compared to generic summaries or raw data, albeit with increased task duration. Storyboards are found to be the most effective presentation modality, especially for object-related queries. This work represents, to our knowledge, the first zero-shot application of ViFMs for generating multi-modal robot-to-human communication in intermittent supervision contexts, demonstrating both the promise and limitations of these models in human-robot interaction (HRI) scenarios.

'What did the Robot do in my Absence?' Video Foundation Models to Enhance Intermittent Supervision

TL;DR

It is revealed that query-driven summaries significantly improve retrieval accuracy compared to generic summaries or raw data, albeit with increased task duration, and storyboards is found to be the most effective presentation modality, especially for object-related queries.

Abstract

This paper investigates the application of Video Foundation Models (ViFMs) for generating robot data summaries to enhance intermittent human supervision of robot teams. We propose a novel framework that produces both generic and query-driven summaries of long-duration robot vision data in three modalities: storyboards, short videos, and text. Through a user study involving 30 participants, we evaluate the efficacy of these summary methods in allowing operators to accurately retrieve the observations and actions that occurred while the robot was operating without supervision over an extended duration (40 min). Our findings reveal that query-driven summaries significantly improve retrieval accuracy compared to generic summaries or raw data, albeit with increased task duration. Storyboards are found to be the most effective presentation modality, especially for object-related queries. This work represents, to our knowledge, the first zero-shot application of ViFMs for generating multi-modal robot-to-human communication in intermittent supervision contexts, demonstrating both the promise and limitations of these models in human-robot interaction (HRI) scenarios.

Paper Structure

This paper contains 30 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Diagram of the proposed robot summary generation system. The system generates generic or query-driven summaries of long egocentric robot videos in the form of storyboards, short videos, or text to help a user review the robots' autonomous history.
  • Figure 2: Proposed framework for generating generic and query-driven summaries in the form of storyboard, video or text. There are $4$ main steps: preprocess, embed, relevance, and selection. The models and algorithms used for each step can be replaced with similar models and algorithms.
  • Figure 3: (left): Study procedure and (sub-)tasks. (right): User study interface for query-driven storyboard summary. When the user enters a query into the 'User Query' box, selected images are displayed below the box based on the raw video given at bottom left. Questions are given on the right side.
  • Figure 4: Example image (right) from the front egocentric video feed from a fleet of robots (left) deployed for an underground search and rescue mission kottege2023heterogeneous
  • Figure 5: Post-hoc analysis. (left): Preference score, (right): Word clouds (token frequency) of user queries.