Task-conditioned adaptation of visual features in multi-task policy learning
Pierre Marza, Laetitia Matignon, Olivier Simonin, Christian Wolf
TL;DR
This work tackles the challenge of flexibly adapting perception for multi-task robotic policies by introducing task-conditioned adapters that modulate a frozen, pre-trained Vision Transformer. A single multi-task behavior cloning policy leverages a learnable task embedding to adapt visual features to each downstream task, with known tasks inferred from ground-truth embeddings and unseen tasks addressed through optimization over demonstrations in a few-shot regime. Empirical results on CortexBench show that task-conditioned adapters improve performance over non-adapted baselines and that conditioning on the task embedding yields further gains, while few-shot adaptation demonstrates meaningful generalization to new tasks without weight updates. The approach remains effective across multiple visual backbones and embodiments, suggesting that task-aware perception can be realized with parameter-efficient adapters, enabling scalable, generalist policies.
Abstract
Successfully addressing a wide variety of tasks is a core ability of autonomous agents, requiring flexibly adapting the underlying decision-making strategies and, as we argue in this work, also adapting the perception modules. An analogical argument would be the human visual system, which uses top-down signals to focus attention determined by the current task. Similarly, we adapt pre-trained large vision models conditioned on specific downstream tasks in the context of multi-task policy learning. We introduce task-conditioned adapters that do not require finetuning any pre-trained weights, combined with a single policy trained with behavior cloning and capable of addressing multiple tasks. We condition the visual adapters on task embeddings, which can be selected at inference if the task is known, or alternatively inferred from a set of example demonstrations. To this end, we propose a new optimization-based estimator. We evaluate the method on a wide variety of tasks from the CortexBench benchmark and show that, compared to existing work, it can be addressed with a single policy. In particular, we demonstrate that adapting visual features is a key design choice and that the method generalizes to unseen tasks given a few demonstrations.
