Table of Contents
Fetching ...

SpawnNet: Learning Generalizable Visuomotor Skills from Pre-trained Networks

Xingyu Lin, John So, Sashwat Mahalingam, Fangchen Liu, Pieter Abbeel

TL;DR

SpawnNet tackles generalization in visuomotor skills by leveraging pre-trained vision representations through a two-stream architecture that fuses multi-layer ViT features with a learnable perception stream via adapters. The approach alleviates the bottleneck of frozen backbones, enabling robust policy learning for diverse objects in both simulation and real-world tasks. Across Open Door/Open Drawer (simulation) and three real-world manipulation tasks, SpawnNet consistently outperforms frozen and from-scratch baselines, with notable gains from dense spatial features and depth augmentation. The work demonstrates that adaptive fusion of pre-trained features, not mere freezing, yields stronger cross-instance generalization and offers a practical path for deploying generalizable robotic manipulation policies.

Abstract

The existing internet-scale image and video datasets cover a wide range of everyday objects and tasks, bringing the potential of learning policies that generalize in diverse scenarios. Prior works have explored visual pre-training with different self-supervised objectives. Still, the generalization capabilities of the learned policies and the advantages over well-tuned baselines remain unclear from prior studies. In this work, we present a focused study of the generalization capabilities of the pre-trained visual representations at the categorical level. We identify the key bottleneck in using a frozen pre-trained visual backbone for policy learning and then propose SpawnNet, a novel two-stream architecture that learns to fuse pre-trained multi-layer representations into a separate network to learn a robust policy. Through extensive simulated and real experiments, we show significantly better categorical generalization compared to prior approaches in imitation learning settings. Open-sourced code and videos can be found on our website: https://xingyu-lin.github.io/spawnnet.

SpawnNet: Learning Generalizable Visuomotor Skills from Pre-trained Networks

TL;DR

SpawnNet tackles generalization in visuomotor skills by leveraging pre-trained vision representations through a two-stream architecture that fuses multi-layer ViT features with a learnable perception stream via adapters. The approach alleviates the bottleneck of frozen backbones, enabling robust policy learning for diverse objects in both simulation and real-world tasks. Across Open Door/Open Drawer (simulation) and three real-world manipulation tasks, SpawnNet consistently outperforms frozen and from-scratch baselines, with notable gains from dense spatial features and depth augmentation. The work demonstrates that adaptive fusion of pre-trained features, not mere freezing, yields stronger cross-instance generalization and offers a practical path for deploying generalizable robotic manipulation policies.

Abstract

The existing internet-scale image and video datasets cover a wide range of everyday objects and tasks, bringing the potential of learning policies that generalize in diverse scenarios. Prior works have explored visual pre-training with different self-supervised objectives. Still, the generalization capabilities of the learned policies and the advantages over well-tuned baselines remain unclear from prior studies. In this work, we present a focused study of the generalization capabilities of the pre-trained visual representations at the categorical level. We identify the key bottleneck in using a frozen pre-trained visual backbone for policy learning and then propose SpawnNet, a novel two-stream architecture that learns to fuse pre-trained multi-layer representations into a separate network to learn a robust policy. Through extensive simulated and real experiments, we show significantly better categorical generalization compared to prior approaches in imitation learning settings. Open-sourced code and videos can be found on our website: https://xingyu-lin.github.io/spawnnet.
Paper Structure (24 sections, 10 figures, 6 tables)

This paper contains 24 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Prior approaches for learning policies (a) from scratch, (b) from a pre-trained visual representation with a frozen backbone, and (c) the proposed two-stream architecture. The right figure (d) shows their performances on a real-world imitation learning task, evaluated on both seen and unseen instances in a category.
  • Figure 2: We consider three challenging categorical manipulation tasks in the real world. For each task, we train on three instances (green boxes) and test on held-out instances (red boxes), with additional variations in poses, articulation, visual distraction, and deformation.
  • Figure 3: Adapter layers to fuse the pre-trained features with the learnable features.
  • Figure 4: Simulation results on Open Door and Open Drawers. The left figures show the training and novel instances we use. The observations are rendered from the agent's middle camera. We add red spheres in the scene to specify the task of which door/drawer to open. The right shows success rates of different methods in both seen and unseen instances after a fixed number of agent rollouts. The dashed black line shows the RL expert's performance. The error bars show the standard error computed from three random seeds. Numbers in the brackets denote the number of instances.
  • Figure 5: Real-world manipulation results on three tasks. All methods here use the same data augmentation. We evaluate each method on each instance over 5 trials (more than 30 trials on novel instances). We report the mean and standard error.
  • ...and 5 more figures