Table of Contents
Fetching ...

VLM-Vac: Enhancing Smart Vacuums through VLM Knowledge Distillation and Language-Guided Experience Replay

Reihaneh Mirjalili, Michael Krawez, Florian Walter, Wolfram Burgard

TL;DR

VLM-Vac tackles open-world perception for domestic robots by distilling knowledge from a Vision-Language Model into a compact detector (YOLOv8n) and pairing it with language-guided experience replay to support continual learning in dynamic home environments. The approach uses a dual workflow: (1) knowledge distillation from a VLM (GPT-4o) to a smaller model via an experience pool, and (2) language-based clustering of experiences to form a balanced replay buffer that mitigates forgetting. Experimental results on a TurtleBot 4 Pro dataset show language-guided clustering yields higher class-purity than vision-based methods, and the KD-plus-replay framework achieves near-parallel F1 performance to cumulative training while reducing energy consumption by roughly 53% and lowering VLM query frequency over time. These findings demonstrate practical gains in efficiency and robustness for autonomous vacuum cleaners operating in diverse home settings, with potential for broader open-world robotics applications.

Abstract

In this paper, we propose VLM-Vac, a novel framework designed to enhance the autonomy of smart robot vacuum cleaners. Our approach integrates the zero-shot object detection capabilities of a Vision-Language Model (VLM) with a Knowledge Distillation (KD) strategy. By leveraging the VLM, the robot can categorize objects into actionable classes -- either to avoid or to suck -- across diverse backgrounds. However, frequently querying the VLM is computationally expensive and impractical for real-world deployment. To address this issue, we implement a KD process that gradually transfers the essential knowledge of the VLM to a smaller, more efficient model. Our real-world experiments demonstrate that this smaller model progressively learns from the VLM and requires significantly fewer queries over time. Additionally, we tackle the challenge of continual learning in dynamic home environments by exploiting a novel experience replay method based on language-guided sampling. Our results show that this approach is not only energy-efficient but also surpasses conventional vision-based clustering methods, particularly in detecting small objects across diverse backgrounds.

VLM-Vac: Enhancing Smart Vacuums through VLM Knowledge Distillation and Language-Guided Experience Replay

TL;DR

VLM-Vac tackles open-world perception for domestic robots by distilling knowledge from a Vision-Language Model into a compact detector (YOLOv8n) and pairing it with language-guided experience replay to support continual learning in dynamic home environments. The approach uses a dual workflow: (1) knowledge distillation from a VLM (GPT-4o) to a smaller model via an experience pool, and (2) language-based clustering of experiences to form a balanced replay buffer that mitigates forgetting. Experimental results on a TurtleBot 4 Pro dataset show language-guided clustering yields higher class-purity than vision-based methods, and the KD-plus-replay framework achieves near-parallel F1 performance to cumulative training while reducing energy consumption by roughly 53% and lowering VLM query frequency over time. These findings demonstrate practical gains in efficiency and robustness for autonomous vacuum cleaners operating in diverse home settings, with potential for broader open-world robotics applications.

Abstract

In this paper, we propose VLM-Vac, a novel framework designed to enhance the autonomy of smart robot vacuum cleaners. Our approach integrates the zero-shot object detection capabilities of a Vision-Language Model (VLM) with a Knowledge Distillation (KD) strategy. By leveraging the VLM, the robot can categorize objects into actionable classes -- either to avoid or to suck -- across diverse backgrounds. However, frequently querying the VLM is computationally expensive and impractical for real-world deployment. To address this issue, we implement a KD process that gradually transfers the essential knowledge of the VLM to a smaller, more efficient model. Our real-world experiments demonstrate that this smaller model progressively learns from the VLM and requires significantly fewer queries over time. Additionally, we tackle the challenge of continual learning in dynamic home environments by exploiting a novel experience replay method based on language-guided sampling. Our results show that this approach is not only energy-efficient but also surpasses conventional vision-based clustering methods, particularly in detecting small objects across diverse backgrounds.
Paper Structure (10 sections, 4 equations, 7 figures, 1 table)

This paper contains 10 sections, 4 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Overview of our smart robotic vacuum cleaner system. The TurtleBot 4 platform, shown on the left, is used in our experiments and resembles a robotic vacuum cleaner. The images on the right, captured by the robot's camera, illustrate our system's real-time detection of "suck" or "avoid" actions.
  • Figure 2: VLM-Vac in a nutshell: We distill relevant knowledge from a Vision-Language Model (VLM) into a compact action-based object detector. The smaller model queries the VLM whenever it encounters an unfamiliar object or background. The new image, its text description from the VLM, and corresponding bounding boxes from the open vocabulary object detector are stored in the experience pool and later used for training the smaller model. Over time, the smaller model learns from these interactions, adapting to its specific environment and thus reducing the need for VLM queries.
  • Figure 3: Sample clusters from $k$-means clustering. The top row displays a sample cluster from the vision-based clustering approach, while the bottom row shows two sample clusters from the language-based clustering approach.
  • Figure 4: The $F_1$ score for naive fine-tuning, language-based experience replay and cumulative learning across $9$ consecutive days averaged over 10 runs with different random seeds. Background shadings represent 3-day intervals. The bars indicate standard deviations.
  • Figure 5: Results of action-based object classification with fine-tuned YOLOv8n on days $7$, $8$, and $9$. The model effectively classifies small objects even in complex backgrounds. To improve visualization, we enlarged the bounding boxes.
  • ...and 2 more figures