Table of Contents
Fetching ...

Robot-Powered Data Flywheels: Deploying Robots in the Wild for Continual Data Collection and Foundation Model Adaptation

Jennifer Grannen, Michelle Pan, Kenneth Llontop, Cherie Ho, Mark Zolotas, Jeannette Bohg, Dorsa Sadigh

TL;DR

The paper tackles foundation model brittleness in unstructured real-world settings by introducing the Robot-Powered Data Flywheel (RPDF), which converts embodied robots into autonomous data collectors. It formalizes an iterative cycle where raw data gathered by a robot powered by FM_{t-1} are curated and accumulated as $\\mathcal{D}_t = \\bigcup_{k=1}^t D_k$ to finetune to FM_t, enabling continual domain-specific adaptation and domain-adjacent generalization. The Scanford deployment in the East Asia Library demonstrates substantial gains: domain-specific book identification improves from $32.4\%$ to $71.8\%$, English multilingual OCR from $24.8\%$ to $46.6\%$, and Chinese from $30.8\%$ to $38.0\%$, while saving roughly $18.7$ hours of human labor and collecting data from 2,103 shelves. This work shows a practical, scalable path to continuously refine foundation models through embodied data collection in messy real-world environments.

Abstract

Foundation models (FM) have unlocked powerful zero-shot capabilities in vision and language, yet their reliance on internet pretraining data leaves them brittle in unstructured, real-world settings. The messy, real-world data encountered during deployment (e.g. occluded or multilingual text) remains massively underrepresented in existing corpora. Robots, as embodied agents, are uniquely positioned to close this gap: they can act in physical environments to collect large-scale, real-world data that enriches FM training with precisely the examples current models lack. We introduce the Robot-Powered Data Flywheel, a framework that transforms robots from FM consumers into data generators. By deploying robots equipped with FMs in the wild, we enable a virtuous cycle: robots perform useful tasks while collecting real-world data that improves both domain-specific adaptation and domain-adjacent generalization. We instantiate this framework with Scanford, a mobile manipulator deployed in the East Asia Library for 2 weeks. Scanford autonomously scans shelves, identifies books using a vision-language model (VLM), and leverages the library catalog to label images without human annotation. This deployment both aids librarians and produces a dataset to finetune the underlying VLM, improving performance on the domain-specific in-the-wild library setting and on domain-adjacent multilingual OCR benchmarks. Using data collected from 2103 shelves, Scanford improves VLM performance on book identification from 32.0% to 71.8% and boosts domain-adjacent multilingual OCR from 24.8% to 46.6% (English) and 30.8% to 38.0% (Chinese), while saving an ~18.7 hrs of human time. These results highlight how robot-powered data flywheels can both reduce human effort in real deployments and unlock new pathways for continually adapting FMs to the messiness of reality. More details are at: https://scanford-robot.github.io

Robot-Powered Data Flywheels: Deploying Robots in the Wild for Continual Data Collection and Foundation Model Adaptation

TL;DR

The paper tackles foundation model brittleness in unstructured real-world settings by introducing the Robot-Powered Data Flywheel (RPDF), which converts embodied robots into autonomous data collectors. It formalizes an iterative cycle where raw data gathered by a robot powered by FM_{t-1} are curated and accumulated as to finetune to FM_t, enabling continual domain-specific adaptation and domain-adjacent generalization. The Scanford deployment in the East Asia Library demonstrates substantial gains: domain-specific book identification improves from to , English multilingual OCR from to , and Chinese from to , while saving roughly hours of human labor and collecting data from 2,103 shelves. This work shows a practical, scalable path to continuously refine foundation models through embodied data collection in messy real-world environments.

Abstract

Foundation models (FM) have unlocked powerful zero-shot capabilities in vision and language, yet their reliance on internet pretraining data leaves them brittle in unstructured, real-world settings. The messy, real-world data encountered during deployment (e.g. occluded or multilingual text) remains massively underrepresented in existing corpora. Robots, as embodied agents, are uniquely positioned to close this gap: they can act in physical environments to collect large-scale, real-world data that enriches FM training with precisely the examples current models lack. We introduce the Robot-Powered Data Flywheel, a framework that transforms robots from FM consumers into data generators. By deploying robots equipped with FMs in the wild, we enable a virtuous cycle: robots perform useful tasks while collecting real-world data that improves both domain-specific adaptation and domain-adjacent generalization. We instantiate this framework with Scanford, a mobile manipulator deployed in the East Asia Library for 2 weeks. Scanford autonomously scans shelves, identifies books using a vision-language model (VLM), and leverages the library catalog to label images without human annotation. This deployment both aids librarians and produces a dataset to finetune the underlying VLM, improving performance on the domain-specific in-the-wild library setting and on domain-adjacent multilingual OCR benchmarks. Using data collected from 2103 shelves, Scanford improves VLM performance on book identification from 32.0% to 71.8% and boosts domain-adjacent multilingual OCR from 24.8% to 46.6% (English) and 30.8% to 38.0% (Chinese), while saving an ~18.7 hrs of human time. These results highlight how robot-powered data flywheels can both reduce human effort in real deployments and unlock new pathways for continually adapting FMs to the messiness of reality. More details are at: https://scanford-robot.github.io

Paper Structure

This paper contains 17 sections, 3 equations, 3 figures, 1 algorithm.

Figures (3)

  • Figure 2: In-the-Wild Challenges at the East Asia Library: [Top]: We visualize a representative section of library shelves that were scanned for each day of deployment. We note the highly varied setting, with challenges such as varied shelf heights, lengths, backgrounds, and lighting. We especially highlight the exceedingly short shelves on Tuesday of Week 2, which are only three shelves tall rather than the standard height of seven. We hypothesize that this change in height made sensing the shelf positions from the LiDAR data more noisy, leading to a higher number of human interventions needed that day (12). [Bottom]: We present challenges encountered in library shelves (damage, occlusions, and fading/aging of multilingual book labels) which necessitate deploying the data flywheel to improve the VLM performance.
  • Figure 3: Adaptation Results with Flywheel Data: Fine-tuning a VLM improves both domain-specific performance and domain-adjacent generalization. Fine-tuning Qwen2.5 for book identification increased performance from 32.4% to 71.8% [Left]. Critically, fine-tuning Qwen with the same data also achieves impressive gains for domain-adjacent generalization (multilingual OCR) -- from 24.8% to 46.6% for English text and from 30.8% to 38.0% for Chinese text [Right]. We hypothesize Gemini's poor performance at Chinese OCR is because it likely has less Chinese in its pretaining mixture. Qwen on the other hand emphasizes Chinese performance during its pretraining qwen2025qwen25technicalreport.
  • Figure 4: Scanford saves librarian time while improving VLM performance. We report the results of Scanford's autonomous, in-the-wild deployment in the East Asia Library. Scanford scanned and labeled 2,103 shelves -- saving a librarian's estimate of 18.7 hours of human time [Left] -- while simultaneously improving VLM performance [Center]. Over 10 days of data collection, only 26 human interventions were needed, each averaging under 5 minutes [Right].