Table of Contents
Fetching ...

Robot Fleet Learning via Policy Merging

Lirui Wang, Kaiqing Zhang, Allan Zhou, Max Simchowitz, Russ Tedrake

TL;DR

The paper addresses fleet-level policy learning under limited bandwidth by proposing policy merging (PoMe) and introducing Fleet-Merge, an algorithm that aligns multiple recurrent policies to a common reference using soft permutation projections. It demonstrates that merging $N$ locally trained policies with non-iid data can yield a single, effective policy $ heta_{\mathrm{mrg}}$ without sharing training data, outperforming naive averaging and matching centralized training in many cases. The method is validated on 50 Meta-World tasks and a new Drake-based FLEET-TOOLS benchmark, showing strong test-time performance, flat mode connectivity, and robustness to decentralization. This work advances scalable, data-efficient fleet learning for robotics by enabling diverse skill consolidation with minimal communication and without centralized data collection.

Abstract

Fleets of robots ingest massive amounts of heterogeneous streaming data silos generated by interacting with their environments, far more than what can be stored or transmitted with ease. At the same time, teams of robots should co-acquire diverse skills through their heterogeneous experiences in varied settings. How can we enable such fleet-level learning without having to transmit or centralize fleet-scale data? In this paper, we investigate policy merging (PoMe) from such distributed heterogeneous datasets as a potential solution. To efficiently merge policies in the fleet setting, we propose FLEET-MERGE, an instantiation of distributed learning that accounts for the permutation invariance that arises when parameterizing the control policies with recurrent neural networks. We show that FLEET-MERGE consolidates the behavior of policies trained on 50 tasks in the Meta-World environment, with good performance on nearly all training tasks at test time. Moreover, we introduce a novel robotic tool-use benchmark, FLEET-TOOLS, for fleet policy learning in compositional and contact-rich robot manipulation tasks, to validate the efficacy of FLEET-MERGE on the benchmark.

Robot Fleet Learning via Policy Merging

TL;DR

The paper addresses fleet-level policy learning under limited bandwidth by proposing policy merging (PoMe) and introducing Fleet-Merge, an algorithm that aligns multiple recurrent policies to a common reference using soft permutation projections. It demonstrates that merging locally trained policies with non-iid data can yield a single, effective policy without sharing training data, outperforming naive averaging and matching centralized training in many cases. The method is validated on 50 Meta-World tasks and a new Drake-based FLEET-TOOLS benchmark, showing strong test-time performance, flat mode connectivity, and robustness to decentralization. This work advances scalable, data-efficient fleet learning for robotics by enabling diverse skill consolidation with minimal communication and without centralized data collection.

Abstract

Fleets of robots ingest massive amounts of heterogeneous streaming data silos generated by interacting with their environments, far more than what can be stored or transmitted with ease. At the same time, teams of robots should co-acquire diverse skills through their heterogeneous experiences in varied settings. How can we enable such fleet-level learning without having to transmit or centralize fleet-scale data? In this paper, we investigate policy merging (PoMe) from such distributed heterogeneous datasets as a potential solution. To efficiently merge policies in the fleet setting, we propose FLEET-MERGE, an instantiation of distributed learning that accounts for the permutation invariance that arises when parameterizing the control policies with recurrent neural networks. We show that FLEET-MERGE consolidates the behavior of policies trained on 50 tasks in the Meta-World environment, with good performance on nearly all training tasks at test time. Moreover, we introduce a novel robotic tool-use benchmark, FLEET-TOOLS, for fleet policy learning in compositional and contact-rich robot manipulation tasks, to validate the efficacy of FLEET-MERGE on the benchmark.
Paper Structure (16 sections, 1 theorem, 9 equations, 5 figures, 1 algorithm)

This paper contains 16 sections, 1 theorem, 9 equations, 5 figures, 1 algorithm.

Key Result

Proposition 4.1

Any recurrent neural network given by eq:rnn is invariant to any transformation of $\mathcal{P} \in \mathcal{G}_{\mathrm{perm}}$.

Figures (5)

  • Figure 1: We consider the problem of merging multiple policies trained on potentially distinct and diverse tasks, which can be more computation and communication efficient than pooling all data together for joint training. Instead of acquiring astronomical size of data from the top-down (red arrow, requiring terabytes-per-day worth of data transfer), we demonstrate that the bottom-up approach (green arrow, megabytes-per-day): merging from locally trained policies, can also produce general policies that incorporate skills learned by the individual constituent policies. Moreover, local training and sharing weights are more suitable for agents that actively generate data, which is especially the case in robotic control, and are more efficient in communicating with the other agents. We aim to achieve the following objective in fleet learning: One robot learns, the entire fleet learns.
  • Figure 2: Drake Tool-Use Benchmark. We develop several tool-use tasks that focus on contact-rich motions and compositions, including using spatula, knife, hammer, and wrench in the Drake drake simulator.
  • Figure 3: (Top) Mode Connectivity Setting. Recall the performance barrier, and the x-axis denotes the interpolation ratios $\lambda$ between the two models and the y-axis represents the success rates of the policy rollouts. (Left) Skill Merging Setting. The x-axis denotes the non-IIDness (increasing to the right). The performance is upper-bounded and lower-bounded by joint training and single-shot merging. (Right) Decentralized Policy Learning Setting. The x-axis denotes the training epochs between every time we merge the model (decreasing to the right).
  • Figure 4: (Left) Mode Connectivity for Policies on Meta-World. We observe that the connectivity curve is basically flat. (Right) Different Algorithms Applied on Merging Policies from Multiple Tasks.
  • Figure 5: We showed that FedAvg with our merged algorithm achieves better performance on Metaworld in decentralized settings, when changing the communication epochs (10 tasks) and the partial participation ratios (25 tasks). The performance is upper-bounded and lower-bounded by the joint policy success rates and the merged policy that averages only once.

Theorems & Definitions (1)

  • Proposition 4.1