Context-Aware Command Understanding for Tabletop Scenarios

Paul Gajewski; Antonio Galiza Cerdeira Gonzalez; Bipin Indurkhya

Context-Aware Command Understanding for Tabletop Scenarios

Paul Gajewski, Antonio Galiza Cerdeira Gonzalez, Bipin Indurkhya

TL;DR

The strengths and limitations of the system are discussed, with particular focus on how it handles multimodal command interpretation, and its ability to be integrated into symbolic robotic frameworks for safe and explainable decision-making.

Abstract

This paper presents a novel hybrid algorithm designed to interpret natural human commands in tabletop scenarios. By integrating multiple sources of information, including speech, gestures, and scene context, the system extracts actionable instructions for a robot, identifying relevant objects and actions. The system operates in a zero-shot fashion, without reliance on predefined object models, enabling flexible and adaptive use in various environments. We assess the integration of multiple deep learning models, evaluating their suitability for deployment in real-world robotic setups. Our algorithm performs robustly across different tasks, combining language processing with visual grounding. In addition, we release a small dataset of video recordings used to evaluate the system. This dataset captures real-world interactions in which a human provides instructions in natural language to a robot, a contribution to future research on human-robot interaction. We discuss the strengths and limitations of the system, with particular focus on how it handles multimodal command interpretation, and its ability to be integrated into symbolic robotic frameworks for safe and explainable decision-making.

Context-Aware Command Understanding for Tabletop Scenarios

TL;DR

Abstract

Paper Structure (15 sections, 2 figures, 3 tables)

This paper contains 15 sections, 2 figures, 3 tables.

Introduction
Related Work
Methodology
Input processing
Audio transcription
Textual command understanding
Video processing
Pointing gesture understanding
Object handling
Target handling
Compiling the command representation
Applying the command representation
Evaluation
Results
Conclusions & Future Work

Figures (2)

Figure 1: Examples of automatically generated annotations created from videos. The text at the top lists the extracted elements of the command in the format: object - action - target. All detected objects relevant to the command are highlighted on the table. The selected object is highlighted in orange and segmented out, with a contour marked around it. A green arrow represents the detected pointing vector. In the top example, the command was: “Take this plate and stack it on top of the other plate.” For the lower example, the command was: “Take the banana and put it inside the frying pan.”
Figure 2: Overview of the Information Flow and General Architecture of the Algorithm. Square-cornered boxes denote procedural decision-making components, while round-cornered boxes represent deep learning models.

Context-Aware Command Understanding for Tabletop Scenarios

TL;DR

Abstract

Context-Aware Command Understanding for Tabletop Scenarios

Authors

TL;DR

Abstract

Table of Contents

Figures (2)