Table of Contents
Fetching ...

MOSAIC: Modular Foundation Models for Assistive and Interactive Cooking

Huaxiaoyue Wang, Kushal Kedia, Juntao Ren, Rahma Abdullah, Atiksh Bhardwaj, Angela Chao, Kelly Y Chen, Nathaniel Chin, Prithwish Dan, Xinyi Fan, Gonzalo Gonzalez-Pumariega, Aditya Kompella, Maximus Adrian Pace, Yash Sharma, Xiangwan Sun, Neha Sunkara, Sanjiban Choudhury

TL;DR

MOSAIC addresses the challenge of coordinating multiple household robots and humans to perform long-horizon cooking tasks with open vocabulary objects. It introduces a modular stack that couples large foundation models for high-level reasoning with specialized low-level controllers, enabling scalable and interpretable multi-agent collaboration. Key contributions include a behavior-tree–based Interactive Task Planner, real-time human motion forecasting trained on AMASS and CoMaD, and a visuomotor system combining OWL-ViT, FastSAM, and CLIP for robust perception and manipulation. End-to-end evaluation across 60 trials shows a 68.3% task success rate with 91.6% average subtask completion, and the modular design facilitates error diagnosis and targeted improvements. The work demonstrates practical viability of modular foundation-model robotics for assistive cooking and highlights open challenges in grounding, generalization, and continual learning.

Abstract

We present MOSAIC, a modular architecture for coordinating multiple robots to (a) interact with users using natural language and (b) manipulate an open vocabulary of everyday objects. MOSAIC employs modularity at several levels: it leverages multiple large-scale pre-trained models for high-level tasks like language and image recognition, while using streamlined modules designed for low-level task-specific control. This decomposition allows us to reap the complementary benefits of foundation models as well as precise, more specialized models. Pieced together, our system is able to scale to complex tasks that involve coordinating multiple robots and humans. First, we unit-test individual modules with 180 episodes of visuomotor picking, 60 episodes of human motion forecasting, and 46 online user evaluations of the task planner. We then extensively evaluate MOSAIC with 60 end-to-end trials. We discuss crucial design decisions, limitations of the current system, and open challenges in this domain. The project's website is at https://portal-cornell.github.io/MOSAIC/

MOSAIC: Modular Foundation Models for Assistive and Interactive Cooking

TL;DR

MOSAIC addresses the challenge of coordinating multiple household robots and humans to perform long-horizon cooking tasks with open vocabulary objects. It introduces a modular stack that couples large foundation models for high-level reasoning with specialized low-level controllers, enabling scalable and interpretable multi-agent collaboration. Key contributions include a behavior-tree–based Interactive Task Planner, real-time human motion forecasting trained on AMASS and CoMaD, and a visuomotor system combining OWL-ViT, FastSAM, and CLIP for robust perception and manipulation. End-to-end evaluation across 60 trials shows a 68.3% task success rate with 91.6% average subtask completion, and the modular design facilitates error diagnosis and targeted improvements. The work demonstrates practical viability of modular foundation-model robotics for assistive cooking and highlights open challenges in grounding, generalization, and continual learning.

Abstract

We present MOSAIC, a modular architecture for coordinating multiple robots to (a) interact with users using natural language and (b) manipulate an open vocabulary of everyday objects. MOSAIC employs modularity at several levels: it leverages multiple large-scale pre-trained models for high-level tasks like language and image recognition, while using streamlined modules designed for low-level task-specific control. This decomposition allows us to reap the complementary benefits of foundation models as well as precise, more specialized models. Pieced together, our system is able to scale to complex tasks that involve coordinating multiple robots and humans. First, we unit-test individual modules with 180 episodes of visuomotor picking, 60 episodes of human motion forecasting, and 46 online user evaluations of the task planner. We then extensively evaluate MOSAIC with 60 end-to-end trials. We discuss crucial design decisions, limitations of the current system, and open challenges in this domain. The project's website is at https://portal-cornell.github.io/MOSAIC/
Paper Structure (26 sections, 1 equation, 11 figures, 8 tables)

This paper contains 26 sections, 1 equation, 11 figures, 8 tables.

Figures (11)

  • Figure 1: MOSAIC cooking in the kitchen. MOSAIC interacts with a user via natural language and controls a tabletop manipulator (R1) and a mobile manipulator (R2) to prepare vegetable soup with the user.
  • Figure 2: MOSAIC System Overview. The Interactive Task Planner module communicates with the user via natural language to decide on a recipe. It assigns subtasks to each robot accordingly. The Human Motion Forecasting module extracts and converts the human's 2D post to 3D coordinates, which it uses to predict future human motion. Separately, a VLM takes image and language as input and produces a 3D grasp pose around the object of interest. Combined, all three are taken by the execution policy of the Visuomotor Skill module to produce a final robot action.
  • Figure 3: End-to-end results. On-policy results for 6 recipes, where each recipe is tested through 10 trials. Each recipe contains various subtasks involving different robot skills. We report the number of trials that are completed without any errors and the individual subtask completion rate. We also categorize the failure cases. MOSAIC is able to complete 41/60 tasks with an average subtask completion rate of 91.6$\%$.
  • Figure 4: Task Planner Constraint Violations During Real User Interactions. We receive 46 responses in total (26 from internal and 20 from external study). Each user gets assigned either Tree or One-Prompt We present the total number of constraint violations per category. Tree makes $62.8\%$ fewer constraint violations compared to One-Prompt for the combined responses, $36.2\%$ fewer for internal, and $62.2\%$ fewer for external.
  • Figure 5: Vision backbone example failure cases. We find that a cluttered background and poor lighting conditions to lead to a suboptimal set of bounding boxes for CLIP to score, while more specific prompts produce better bounding box proposals.
  • ...and 6 more figures