Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks

Amit Parekh; Nikolas Vitsakis; Alessandro Suglia; Ioannis Konstas

Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks

Amit Parekh, Nikolas Vitsakis, Alessandro Suglia, Ioannis Konstas

TL;DR

This work introduces a comprehensive evaluation framework that systematically examines the role of instructions and inputs in the generalisation abilities of multimodal models, considering architectural design, input perturbations across language and vision modalities, and increased task complexity.

Abstract

Evaluating the generalisation capabilities of multimodal models based solely on their performance on out-of-distribution data fails to capture their true robustness. This work introduces a comprehensive evaluation framework that systematically examines the role of instructions and inputs in the generalisation abilities of such models, considering architectural design, input perturbations across language and vision modalities, and increased task complexity. The proposed framework uncovers the resilience of multimodal models to extreme instruction perturbations and their vulnerability to observational changes, raising concerns about overfitting to spurious correlations. By employing this evaluation framework on current Transformer-based multimodal models for robotic manipulation tasks, we uncover limitations and suggest future advancements should focus on architectural and training innovations that better integrate multimodal inputs, enhancing a model's generalisation prowess by prioritising sensitivity to input content over incidental correlations.

Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks

TL;DR

Abstract

Paper Structure (68 sections, 1 equation, 7 figures, 29 tables)

This paper contains 68 sections, 1 equation, 7 figures, 29 tables.

Introduction
Related Work
Language-driven Embodied AI
Language in Robotic Manipulation Tasks
Assessing Generalisation and Robustness
Experimental Setup
Evaluation Data
Models
The Evaluation Framework
Substitutivity in Instructions
Baseline
Evaluating on Paraphrases
Training on Paraphrases
Replacing Visual Referents with Descriptors
Perturbations of Instruction Syntax
...and 53 more sections

Figures (7)

Figure 1: Our evaluation framework. Each perturbation affects the instruction or observation inputs, which can be linguistic, visual, or a combination of both. The plausibility of a perturbation relates to a model's expected performance. Sensitivity to unreasonable conditions () indicates that a model should not perform the task successfully given the perturbation, while plausible perturbations () suggest that it should still perform successfully.
Figure 2: Illustration of language perturbations challenging model sensitivity to language content in multimodal instructions: (random characters, increased token length) and (random words, same sequence length).
Figure 3: Difficulty level comparisons to default (first column). Distracting add visual clutter; Extreme changes parameters, complexity, and affordances; and, Extremely Distracting combines both. Top row: T1 ("pick and place into the container"). Bottom row: T15 ("place all objects with the same shape as the container into it"). For illustration purposes, we denote target containers with a green dashed box and target objects with pink dashed box.
Figure 4: Illustration comparing default and permuted object tokens per observation. In the default ordering (top), tokens in each observation follow the same pattern: the container object first, the target object second, and then any distractor objects. The permuted ordering (bottom) randomises the order differently for each observation in the same sequence.
Figure D.1: In-environment observations seen by the model, showing task performance when using Gobbledygook Words. Instructions given to the model are shown on top of the images, with the images themselves showing different iterations of either success (see 1, 2, and 4) or failure (see 3).
...and 2 more figures

Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks

TL;DR

Abstract

Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (7)