Table of Contents
Fetching ...

Task-conditioned adaptation of visual features in multi-task policy learning

Pierre Marza, Laetitia Matignon, Olivier Simonin, Christian Wolf

TL;DR

This work tackles the challenge of flexibly adapting perception for multi-task robotic policies by introducing task-conditioned adapters that modulate a frozen, pre-trained Vision Transformer. A single multi-task behavior cloning policy leverages a learnable task embedding to adapt visual features to each downstream task, with known tasks inferred from ground-truth embeddings and unseen tasks addressed through optimization over demonstrations in a few-shot regime. Empirical results on CortexBench show that task-conditioned adapters improve performance over non-adapted baselines and that conditioning on the task embedding yields further gains, while few-shot adaptation demonstrates meaningful generalization to new tasks without weight updates. The approach remains effective across multiple visual backbones and embodiments, suggesting that task-aware perception can be realized with parameter-efficient adapters, enabling scalable, generalist policies.

Abstract

Successfully addressing a wide variety of tasks is a core ability of autonomous agents, requiring flexibly adapting the underlying decision-making strategies and, as we argue in this work, also adapting the perception modules. An analogical argument would be the human visual system, which uses top-down signals to focus attention determined by the current task. Similarly, we adapt pre-trained large vision models conditioned on specific downstream tasks in the context of multi-task policy learning. We introduce task-conditioned adapters that do not require finetuning any pre-trained weights, combined with a single policy trained with behavior cloning and capable of addressing multiple tasks. We condition the visual adapters on task embeddings, which can be selected at inference if the task is known, or alternatively inferred from a set of example demonstrations. To this end, we propose a new optimization-based estimator. We evaluate the method on a wide variety of tasks from the CortexBench benchmark and show that, compared to existing work, it can be addressed with a single policy. In particular, we demonstrate that adapting visual features is a key design choice and that the method generalizes to unseen tasks given a few demonstrations.

Task-conditioned adaptation of visual features in multi-task policy learning

TL;DR

This work tackles the challenge of flexibly adapting perception for multi-task robotic policies by introducing task-conditioned adapters that modulate a frozen, pre-trained Vision Transformer. A single multi-task behavior cloning policy leverages a learnable task embedding to adapt visual features to each downstream task, with known tasks inferred from ground-truth embeddings and unseen tasks addressed through optimization over demonstrations in a few-shot regime. Empirical results on CortexBench show that task-conditioned adapters improve performance over non-adapted baselines and that conditioning on the task embedding yields further gains, while few-shot adaptation demonstrates meaningful generalization to new tasks without weight updates. The approach remains effective across multiple visual backbones and embodiments, suggesting that task-aware perception can be realized with parameter-efficient adapters, enabling scalable, generalist policies.

Abstract

Successfully addressing a wide variety of tasks is a core ability of autonomous agents, requiring flexibly adapting the underlying decision-making strategies and, as we argue in this work, also adapting the perception modules. An analogical argument would be the human visual system, which uses top-down signals to focus attention determined by the current task. Similarly, we adapt pre-trained large vision models conditioned on specific downstream tasks in the context of multi-task policy learning. We introduce task-conditioned adapters that do not require finetuning any pre-trained weights, combined with a single policy trained with behavior cloning and capable of addressing multiple tasks. We condition the visual adapters on task embeddings, which can be selected at inference if the task is known, or alternatively inferred from a set of example demonstrations. To this end, we propose a new optimization-based estimator. We evaluate the method on a wide variety of tasks from the CortexBench benchmark and show that, compared to existing work, it can be addressed with a single policy. In particular, we demonstrate that adapting visual features is a key design choice and that the method generalizes to unseen tasks given a few demonstrations.
Paper Structure (18 sections, 8 equations, 12 figures, 6 tables)

This paper contains 18 sections, 8 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Task-conditioned adaptation: A single policy can be trained to address multiple heterogenous tasks including manipulation, legged motion etc., and few-shot learning is possible to address tasks given as demonstrations but unseen during training. A key element is the task-conditioned adaptation of visual features.
  • Figure 2: Considered tasks: We train the method on a set $T^k$ of known tasks and evaluate it either on the same set, with the task known (Known task setting), or in a Few-shot setting, where a new unseen task from a set $T^u$ is inferred from a few demonstrations.
  • Figure 3: Method overview: (a) the adapted policy is trained with behavior cloning from expert demonstrations and given a visual encoder pre-trained with MAE. The model is conditioned on a task embedding learned from ground-truth 1-in-K task identifiers. (b) In the Few-shot case, a task embedding is estimated by optimization, maximizing the likelihood of given demonstrations of an unknown task. (c) Inference uses a task embedding predicted in the Known task case, or optimized in the Few-shot case.
  • Figure 4: Known task --- Qualitative results: Three successful policy rollouts on known tasks from the test set. The multi-task approach performs well on a variety of diverse tasks while being trained on a limited set of demonstrations.
  • Figure 5: Known task --- Per-task performance of policies in Table \ref{['table:visual_adapters']}: single-task policies (row (a)), our approach without any adapter (row (b)) and with conditioned middle and top adapters (row (f)). The adapters lead to a performance gain on most tasks, and our multi-task solution is competitive with single-task policies. Colored bars and error bars respectively show mean and std over $3$ training runs (seeds).
  • ...and 7 more figures