Table of Contents
Fetching ...

On the Strengths and Weaknesses of Data for Open-set Embodied Assistance

Pradyumna Tambwekar, Andrew Silva, Deepak Gopinath, Jonathan DeCastro, Xiongyi Cui, Guy Rosman

TL;DR

This approach provides key insights into the nature of assistive datasets required to enable open-set assistive intelligence, and shows that performant models benefit from datasets that cover different aspects of assistance, including multimodal grounding, defect inference, and exposure to diverse scenarios.

Abstract

Embodied foundation models are increasingly performant in real-world domains such as robotics or autonomous driving. These models are often deployed in interactive or assistive settings, where it is important that these assistive models generalize to new users and new tasks. Diverse interactive data generation offers a promising avenue for providing data-efficient generalization capabilities for interactive embodied foundation models. In this paper, we investigate the generalization capabilities of a multimodal foundation model fine-tuned on diverse interactive assistance data in a synthetic domain. We explore generalization along two axes: a) assistance with unseen categories of user behavior and b) providing guidance in new configurations not encountered during training. We study a broad capability called \textbf{Open-Set Corrective Assistance}, in which the model needs to inspect lengthy user behavior and provide assistance through either corrective actions or language-based feedback. This task remains unsolved in prior work, which typically assumes closed corrective categories or relies on external planners, making it a challenging testbed for evaluating the limits of assistive data. To support this task, we generate synthetic assistive datasets in Overcooked and fine-tune a LLaMA-based model to evaluate generalization to novel tasks and user behaviors. Our approach provides key insights into the nature of assistive datasets required to enable open-set assistive intelligence. In particular, we show that performant models benefit from datasets that cover different aspects of assistance, including multimodal grounding, defect inference, and exposure to diverse scenarios.

On the Strengths and Weaknesses of Data for Open-set Embodied Assistance

TL;DR

This approach provides key insights into the nature of assistive datasets required to enable open-set assistive intelligence, and shows that performant models benefit from datasets that cover different aspects of assistance, including multimodal grounding, defect inference, and exposure to diverse scenarios.

Abstract

Embodied foundation models are increasingly performant in real-world domains such as robotics or autonomous driving. These models are often deployed in interactive or assistive settings, where it is important that these assistive models generalize to new users and new tasks. Diverse interactive data generation offers a promising avenue for providing data-efficient generalization capabilities for interactive embodied foundation models. In this paper, we investigate the generalization capabilities of a multimodal foundation model fine-tuned on diverse interactive assistance data in a synthetic domain. We explore generalization along two axes: a) assistance with unseen categories of user behavior and b) providing guidance in new configurations not encountered during training. We study a broad capability called \textbf{Open-Set Corrective Assistance}, in which the model needs to inspect lengthy user behavior and provide assistance through either corrective actions or language-based feedback. This task remains unsolved in prior work, which typically assumes closed corrective categories or relies on external planners, making it a challenging testbed for evaluating the limits of assistive data. To support this task, we generate synthetic assistive datasets in Overcooked and fine-tune a LLaMA-based model to evaluate generalization to novel tasks and user behaviors. Our approach provides key insights into the nature of assistive datasets required to enable open-set assistive intelligence. In particular, we show that performant models benefit from datasets that cover different aspects of assistance, including multimodal grounding, defect inference, and exposure to diverse scenarios.
Paper Structure (32 sections, 19 figures, 5 tables)

This paper contains 32 sections, 19 figures, 5 tables.

Figures (19)

  • Figure 1: We simulate synthetic users in Overcooked to generate multimodal (image + text) gameplay trajectories, which used to distill complementary synthetic datasets. These datasets are designed to (1) ground actions to environmental outcomes and (2) support behavior understanding and correction from trajectories. By training an embodied model on these data, we evaluate whether an embodied foundation model can generalize to unseen defective behaviors and novel task configurations.
  • Figure 2: This figure provides an abstract depiction of each task synthesized in this paper to train our assistive model. Top to Bottom - Trajectory-QA, Video-QA, Image-QA, Corrections, Coaching, Defect Delineation
  • Figure 3: Dataset Example
  • Figure 4: Model Output
  • Figure 5: Prompt utilized to enable GPT-4o to generate overcooked Map configurations.
  • ...and 14 more figures