MOSAIC: Modular Foundation Models for Assistive and Interactive Cooking
Huaxiaoyue Wang, Kushal Kedia, Juntao Ren, Rahma Abdullah, Atiksh Bhardwaj, Angela Chao, Kelly Y Chen, Nathaniel Chin, Prithwish Dan, Xinyi Fan, Gonzalo Gonzalez-Pumariega, Aditya Kompella, Maximus Adrian Pace, Yash Sharma, Xiangwan Sun, Neha Sunkara, Sanjiban Choudhury
TL;DR
MOSAIC addresses the challenge of coordinating multiple household robots and humans to perform long-horizon cooking tasks with open vocabulary objects. It introduces a modular stack that couples large foundation models for high-level reasoning with specialized low-level controllers, enabling scalable and interpretable multi-agent collaboration. Key contributions include a behavior-tree–based Interactive Task Planner, real-time human motion forecasting trained on AMASS and CoMaD, and a visuomotor system combining OWL-ViT, FastSAM, and CLIP for robust perception and manipulation. End-to-end evaluation across 60 trials shows a 68.3% task success rate with 91.6% average subtask completion, and the modular design facilitates error diagnosis and targeted improvements. The work demonstrates practical viability of modular foundation-model robotics for assistive cooking and highlights open challenges in grounding, generalization, and continual learning.
Abstract
We present MOSAIC, a modular architecture for coordinating multiple robots to (a) interact with users using natural language and (b) manipulate an open vocabulary of everyday objects. MOSAIC employs modularity at several levels: it leverages multiple large-scale pre-trained models for high-level tasks like language and image recognition, while using streamlined modules designed for low-level task-specific control. This decomposition allows us to reap the complementary benefits of foundation models as well as precise, more specialized models. Pieced together, our system is able to scale to complex tasks that involve coordinating multiple robots and humans. First, we unit-test individual modules with 180 episodes of visuomotor picking, 60 episodes of human motion forecasting, and 46 online user evaluations of the task planner. We then extensively evaluate MOSAIC with 60 end-to-end trials. We discuss crucial design decisions, limitations of the current system, and open challenges in this domain. The project's website is at https://portal-cornell.github.io/MOSAIC/
