Table of Contents
Fetching ...

ROSER: Few-Shot Robotic Sequence Retrieval for Scalable Robot Learning

Zillur Rahman, Eddison Pham, Alejandro Daniel Noel, Cristian Meo

TL;DR

ROSER is introduced, a lightweight few-shot retrieval framework that learns task-agnostic metric spaces over temporal windows, enabling accurate retrieval with as few as 3-5 demonstrations, without any task-specific training required, fundamentally improving data availability for robot learning.

Abstract

A critical bottleneck in robot learning is the scarcity of task-labeled, segmented training data, despite the abundance of large-scale robotic datasets recorded as long, continuous interaction logs. Existing datasets contain vast amounts of diverse behaviors, yet remain structurally incompatible with modern learning frameworks that require cleanly segmented, task-specific trajectories. We address this data utilization crisis by formalizing robotic sequence retrieval: the task of extracting reusable, task-centric segments from unlabeled logs using only a few reference examples. We introduce ROSER, a lightweight few-shot retrieval framework that learns task-agnostic metric spaces over temporal windows, enabling accurate retrieval with as few as 3-5 demonstrations, without any task-specific training required. To validate our approach, we establish comprehensive evaluation protocols and benchmark ROSER against classical alignment methods, learned embeddings, and language model baselines across three large-scale datasets (e.g., LIBERO, DROID, and nuScenes). Our experiments demonstrate that ROSER consistently outperforms all prior methods in both accuracy and efficiency, achieving sub-millisecond per-match inference while maintaining superior distributional alignment. By reframing data curation as few-shot retrieval, ROSER provides a practical pathway to unlock underutilized robotic datasets, fundamentally improving data availability for robot learning.

ROSER: Few-Shot Robotic Sequence Retrieval for Scalable Robot Learning

TL;DR

ROSER is introduced, a lightweight few-shot retrieval framework that learns task-agnostic metric spaces over temporal windows, enabling accurate retrieval with as few as 3-5 demonstrations, without any task-specific training required, fundamentally improving data availability for robot learning.

Abstract

A critical bottleneck in robot learning is the scarcity of task-labeled, segmented training data, despite the abundance of large-scale robotic datasets recorded as long, continuous interaction logs. Existing datasets contain vast amounts of diverse behaviors, yet remain structurally incompatible with modern learning frameworks that require cleanly segmented, task-specific trajectories. We address this data utilization crisis by formalizing robotic sequence retrieval: the task of extracting reusable, task-centric segments from unlabeled logs using only a few reference examples. We introduce ROSER, a lightweight few-shot retrieval framework that learns task-agnostic metric spaces over temporal windows, enabling accurate retrieval with as few as 3-5 demonstrations, without any task-specific training required. To validate our approach, we establish comprehensive evaluation protocols and benchmark ROSER against classical alignment methods, learned embeddings, and language model baselines across three large-scale datasets (e.g., LIBERO, DROID, and nuScenes). Our experiments demonstrate that ROSER consistently outperforms all prior methods in both accuracy and efficiency, achieving sub-millisecond per-match inference while maintaining superior distributional alignment. By reframing data curation as few-shot retrieval, ROSER provides a practical pathway to unlock underutilized robotic datasets, fundamentally improving data availability for robot learning.
Paper Structure (53 sections, 42 equations, 39 figures, 8 tables, 2 algorithms)

This paper contains 53 sections, 42 equations, 39 figures, 8 tables, 2 algorithms.

Figures (39)

  • Figure 1: Retrieval Framework. The time-series encoder $f_\theta$ uses labeled dataset $\mathcal{S}$ and create a metric space where similar tasks are grouped together and create prototypes while different tasks are separated from each others. The embeddings of unlabeled trajectories $\mathcal{U}$ are compared with prototypes using learned metric $Dist(.)$ to find the closest match and create retrieved set $\mathcal{R}$.
  • Figure 2: Feature-level distribution visualization for task "regular stop" in the nuScenes benchmark.
  • Figure 3: Relationship between Wasserstein Distance and Intra-class distance (diversity). Each point corresponds to a single retrieval model evaluated on the dataset. For each dataset, we report Spearman’s rank correlation ($\rho$), Pearson correlation ($r$), and Kendall Tau ($\tau$).
  • Figure 4: Feature-level distribution visualization for microwave open task in the LIBERO benchmark.
  • Figure 5: LIBERO qualitative results for microwave open task. ROSER retrieves correct sample while Dtaidistance and Stumpy retrieves drawer close task
  • ...and 34 more figures