Table of Contents
Fetching ...

Online Foundation Model Selection in Robotics

Po-han Li, Oyku Selin Toprak, Aditya Narayanan, Ufuk Topcu, Sandeep Chinchali

TL;DR

The paper addresses the problem of selecting between open-source local and closed-source remote foundation models for robotics tasks in an online, data-efficient manner. It introduces a pipeline that uses a fixed open-source encoder to produce contextual features and a contextual online learner (PPO) to decide which model to invoke, optimizing a composite reward that balances accuracy, latency, and cost. The main contributions are formulating online contextual model selection, proposing a practical solution combining an encoder with a contextual learner, and providing theoretical insights along with empirical validation on MMLU and multiple language-based robotic tasks, achieving up to 14% improvement over non-contextual baselines. This approach offers a data-efficient, adaptable framework for real-world robotics applications where new models are frequently released and resource constraints are common.

Abstract

Foundation models have recently expanded into robotics after excelling in computer vision and natural language processing. The models are accessible in two ways: open-source or paid, closed-source options. Users with access to both face a problem when deciding between effective yet costly closed-source models and free but less powerful open-source alternatives. We call it the model selection problem. Existing supervised-learning methods are impractical due to the high cost of collecting extensive training data from closed-source models. Hence, we focus on the online learning setting where algorithms learn while collecting data, eliminating the need for large pre-collected datasets. We thus formulate a user-centric online model selection problem and propose a novel solution that combines an open-source encoder to output context and an online learning algorithm that processes this context. The encoder distills vast data distributions into low-dimensional features, i.e., the context, without additional training. The online learning algorithm aims to maximize a composite reward that includes model performance, execution time, and costs based on the context extracted from the data. It results in an improved trade-off between selecting open-source and closed-source models compared to non-contextual methods, as validated by our theoretical analysis. Experiments across language-based robotic tasks such as Waymo Open Dataset, ALFRED, and Open X-Embodiment demonstrate real-world applications of the solution. The results show that the solution significantly improves the task success rate by up to 14%.

Online Foundation Model Selection in Robotics

TL;DR

The paper addresses the problem of selecting between open-source local and closed-source remote foundation models for robotics tasks in an online, data-efficient manner. It introduces a pipeline that uses a fixed open-source encoder to produce contextual features and a contextual online learner (PPO) to decide which model to invoke, optimizing a composite reward that balances accuracy, latency, and cost. The main contributions are formulating online contextual model selection, proposing a practical solution combining an encoder with a contextual learner, and providing theoretical insights along with empirical validation on MMLU and multiple language-based robotic tasks, achieving up to 14% improvement over non-contextual baselines. This approach offers a data-efficient, adaptable framework for real-world robotics applications where new models are frequently released and resource constraints are common.

Abstract

Foundation models have recently expanded into robotics after excelling in computer vision and natural language processing. The models are accessible in two ways: open-source or paid, closed-source options. Users with access to both face a problem when deciding between effective yet costly closed-source models and free but less powerful open-source alternatives. We call it the model selection problem. Existing supervised-learning methods are impractical due to the high cost of collecting extensive training data from closed-source models. Hence, we focus on the online learning setting where algorithms learn while collecting data, eliminating the need for large pre-collected datasets. We thus formulate a user-centric online model selection problem and propose a novel solution that combines an open-source encoder to output context and an online learning algorithm that processes this context. The encoder distills vast data distributions into low-dimensional features, i.e., the context, without additional training. The online learning algorithm aims to maximize a composite reward that includes model performance, execution time, and costs based on the context extracted from the data. It results in an improved trade-off between selecting open-source and closed-source models compared to non-contextual methods, as validated by our theoretical analysis. Experiments across language-based robotic tasks such as Waymo Open Dataset, ALFRED, and Open X-Embodiment demonstrate real-world applications of the solution. The results show that the solution significantly improves the task success rate by up to 14%.
Paper Structure (13 sections, 16 equations, 8 figures, 1 table)

This paper contains 13 sections, 16 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Examples of language-based robotic tasks: Robots receive language instructions from users and visual observations from the environment and output either actions to pilot the robots (ALFRED and Open X-Embodiment) or answers to the users (Waymo).
  • Figure 2: Online model selection pipeline: A user sends their intentions in natural language and images to a model selected from a range of available options. To do so, an encoder first processes the language and visual inputs to extract features. These features help an online learning algorithm select the suitable model that maximizes accuracy and minimizes response latency and monetary costs. The algorithm should avoid selecting models that execute incorrectly, marked with red crosses. The above examples come from the ALFRED dataset.
  • Figure 3: Context naturally forms clusters: The t-SNE visualization projects high-dimensional CLIP extracted features, i.e., the context, into $2$ dimensions, revealing feature clusters that correspond to categories of tasks, even without the use of such labels.
  • Figure 4: Models perform differently on various tasks: Each model exhibits distinct performance on various tasks, colored differently. The models on the left are the local models with the overall lowest performance. This variation in performance highlights the need to select the most effective models for each task or even data point. The error bars in the plots represent the $95\%$ confidence intervals.
  • Figure 5: Models perform differently on various tasks with latency and costs: Each model exhibits distinct performance on various tasks, represented by different colors. The models on the left are the local models. With latency and costs, the local model has no worse performance, as it is faster and free to execute.
  • ...and 3 more figures