Table of Contents
Fetching ...

Leveraging Foundation Models for Enhancing Robot Perception and Action

Reihaneh Mirjalili

TL;DR

This work investigates how foundation models can endow robots with semantic awareness to improve localization, interaction, and manipulation in unstructured environments. It introduces four contributions: FM-Loc for robust vision based place recognition, Lan-grasp for semantic object grasping, VLM-Vac for efficient autonomous vacuuming via knowledge distillation and language guided continual learning, and ARRO for robust visuomotor policies through open vocabulary visual abstraction. Each module leverages large language models, vision language models, and prompting strategies to achieve zero shot reasoning, while distillation and language guided experience replay enable deployment on resource constrained platforms. Collectively, the results demonstrate improved generalization, data efficiency, and robustness to visual domain shifts, highlighting a path toward scalable, semantics aware robotic intelligence that can operate in real world human environments.

Abstract

This thesis investigates how foundation models can be systematically leveraged to enhance robotic capabilities, enabling more effective localization, interaction, and manipulation in unstructured environments. The work is structured around four core lines of inquiry, each addressing a fundamental challenge in robotics while collectively contributing to a cohesive framework for semantics-aware robotic intelligence.

Leveraging Foundation Models for Enhancing Robot Perception and Action

TL;DR

This work investigates how foundation models can endow robots with semantic awareness to improve localization, interaction, and manipulation in unstructured environments. It introduces four contributions: FM-Loc for robust vision based place recognition, Lan-grasp for semantic object grasping, VLM-Vac for efficient autonomous vacuuming via knowledge distillation and language guided continual learning, and ARRO for robust visuomotor policies through open vocabulary visual abstraction. Each module leverages large language models, vision language models, and prompting strategies to achieve zero shot reasoning, while distillation and language guided experience replay enable deployment on resource constrained platforms. Collectively, the results demonstrate improved generalization, data efficiency, and robustness to visual domain shifts, highlighting a path toward scalable, semantics aware robotic intelligence that can operate in real world human environments.

Abstract

This thesis investigates how foundation models can be systematically leveraged to enhance robotic capabilities, enabling more effective localization, interaction, and manipulation in unstructured environments. The work is structured around four core lines of inquiry, each addressing a fundamental challenge in robotics while collectively contributing to a cohesive framework for semantics-aware robotic intelligence.

Paper Structure

This paper contains 59 sections, 14 equations, 30 figures, 10 tables, 2 algorithms.

Figures (30)

  • Figure 1: Our localization method matches query images (top) to a set of reference images (bottom). Despite considerable differences in viewpoint and object placement between reference and query image sets, our approach correctly recognizes the locations.
  • Figure 2: FM-Loc in a nutshell: We use a Visual-Language Model (VLM) and a Large Language Model (LLM) to achieve robust localization under severe viewpoint and scene changes. For each query image, the VLM detects objects and ranks them by grounding score. The top-k object labels form a prompt for the LLM, which predicts potential room categories. These room labels are re-ranked by the VLM to identify the most likely location. We use this semantic information to compare each query image with reference images and select the one with the highest similarity score.
  • Figure 3: The reference (red) and query (blue) trajectories in dataset 1. The environment consists of four rooms, namely a kitchen (yellow), a hallway (green), a conference room (blue), and an office (red). The images depict these rooms for query and reference trajectories.
  • Figure 4: Comparison of query and retrieved images for different methods.
  • Figure 5: Robot performing the command of "Pick up the ice cream please". The grasp on the left is generated without including semantic information while the grasp on the right is performed using our method leveraging a deeper understanding of the task and the object provided by Large Language Models.
  • ...and 25 more figures