Leveraging Foundation Models for Enhancing Robot Perception and Action
Reihaneh Mirjalili
TL;DR
This work investigates how foundation models can endow robots with semantic awareness to improve localization, interaction, and manipulation in unstructured environments. It introduces four contributions: FM-Loc for robust vision based place recognition, Lan-grasp for semantic object grasping, VLM-Vac for efficient autonomous vacuuming via knowledge distillation and language guided continual learning, and ARRO for robust visuomotor policies through open vocabulary visual abstraction. Each module leverages large language models, vision language models, and prompting strategies to achieve zero shot reasoning, while distillation and language guided experience replay enable deployment on resource constrained platforms. Collectively, the results demonstrate improved generalization, data efficiency, and robustness to visual domain shifts, highlighting a path toward scalable, semantics aware robotic intelligence that can operate in real world human environments.
Abstract
This thesis investigates how foundation models can be systematically leveraged to enhance robotic capabilities, enabling more effective localization, interaction, and manipulation in unstructured environments. The work is structured around four core lines of inquiry, each addressing a fundamental challenge in robotics while collectively contributing to a cohesive framework for semantics-aware robotic intelligence.
