Table of Contents
Fetching ...

Bigger, Regularized, Categorical: High-Capacity Value Functions are Efficient Multi-Task Learners

Michal Nauman, Marek Cygan, Carmelo Sferrazza, Aviral Kumar, Pieter Abbeel

TL;DR

This work tackles the challenge of scaling online value-based reinforcement learning to many tasks by introducing Bigger, Regularized, Categorical (BRC). BRC combines a scaled, residual Q-value model (BroNet), cross-entropy loss via distributional RL, and learnable task embeddings learned online through TD loss, augmented by per-task reward normalization. Across 283 tasks in five benchmarks, BRC achieves state-of-the-art performance in both single-task and multi-task settings and demonstrates substantial sample efficiency in transferring to new tasks, even at 1B parameter scale. The results show that online multi-task TD learning can be computationally efficient and that pretrained multi-task value models transfer effectively, challenging the notion that online scaling requires offline data or behavioral cloning. The approach presents a practical foundation for generalist value models in RL and opens avenues for further understanding task interactions and transfer potential.

Abstract

Recent advances in language modeling and vision stem from training large models on diverse, multi-task data. This paradigm has had limited impact in value-based reinforcement learning (RL), where improvements are often driven by small models trained in a single-task context. This is because in multi-task RL sparse rewards and gradient conflicts make optimization of temporal difference brittle. Practical workflows for generalist policies therefore avoid online training, instead cloning expert trajectories or distilling collections of single-task policies into one agent. In this work, we show that the use of high-capacity value models trained via cross-entropy and conditioned on learnable task embeddings addresses the problem of task interference in online RL, allowing for robust and scalable multi-task training. We test our approach on 7 multi-task benchmarks with over 280 unique tasks, spanning high degree-of-freedom humanoid control and discrete vision-based RL. We find that, despite its simplicity, the proposed approach leads to state-of-the-art single and multi-task performance, as well as sample-efficient transfer to new tasks.

Bigger, Regularized, Categorical: High-Capacity Value Functions are Efficient Multi-Task Learners

TL;DR

This work tackles the challenge of scaling online value-based reinforcement learning to many tasks by introducing Bigger, Regularized, Categorical (BRC). BRC combines a scaled, residual Q-value model (BroNet), cross-entropy loss via distributional RL, and learnable task embeddings learned online through TD loss, augmented by per-task reward normalization. Across 283 tasks in five benchmarks, BRC achieves state-of-the-art performance in both single-task and multi-task settings and demonstrates substantial sample efficiency in transferring to new tasks, even at 1B parameter scale. The results show that online multi-task TD learning can be computationally efficient and that pretrained multi-task value models transfer effectively, challenging the notion that online scaling requires offline data or behavioral cloning. The approach presents a practical foundation for generalist value models in RL and opens avenues for further understanding task interactions and transfer potential.

Abstract

Recent advances in language modeling and vision stem from training large models on diverse, multi-task data. This paradigm has had limited impact in value-based reinforcement learning (RL), where improvements are often driven by small models trained in a single-task context. This is because in multi-task RL sparse rewards and gradient conflicts make optimization of temporal difference brittle. Practical workflows for generalist policies therefore avoid online training, instead cloning expert trajectories or distilling collections of single-task policies into one agent. In this work, we show that the use of high-capacity value models trained via cross-entropy and conditioned on learnable task embeddings addresses the problem of task interference in online RL, allowing for robust and scalable multi-task training. We test our approach on 7 multi-task benchmarks with over 280 unique tasks, spanning high degree-of-freedom humanoid control and discrete vision-based RL. We find that, despite its simplicity, the proposed approach leads to state-of-the-art single and multi-task performance, as well as sample-efficient transfer to new tasks.

Paper Structure

This paper contains 22 sections, 7 equations, 24 figures, 5 tables.

Figures (24)

  • Figure 1: Scaling multi-task training leads to state-of-the-art performance. Naïve scaling of SAC to multi-task decreases the aggregate performance (left). Our proposed method (BRC) works both in single and multi-task learning and provides a pronounced performance improvement over previous approaches, including optimized single-task learners (right). We denote multi-task agents with $\star$.
  • Figure 2: Scaling multi-task training allows for sample-efficient transfer to new tasks. We compare the performance of single-task BRC agent trained from scratch (green), to an agent initialized with our pretrained multi-task BRC agent trained on different tasks (blue). We find that transferring a multi-task BRC model to new tasks leads to better sample efficiency than learning from scratch. Y-axis denotes the average final performance.
  • Figure 3: Cross-entropy loss stabilizes online multi-task learning. We investigate BRC with naive application of MSE loss (purple), MSE loss paired with return normalization (green) and cross-entropy paired with return normalization (blue) on HB-Medium. Varying reward magnitudes in multi-task learning can destabilize learning of certain tasks, which translates to high variance of signals between tasks (left). Stabilizing this effect via cross-entropy loss allows for improved scaling when moving from single to multi-task learning (right).
  • Figure 4: BroNet paired with cross-entropy loss scales in both single and multi-task RL. We compare scaling behavior of different architectures in single (left) and multi-task (right) when solving the HB-Medium benchmark. We pair SAC with the vanilla haarnoja2018soft, SimBa lee2024simba, BroNet with mean squared error loss nauman2024bigger, and proposed BroNet with cross-entropy loss architectures. Both figures report final performance after 1M steps.
  • Figure 5: Using task embeddings is preferable to separate heads design. We compare the performance (left) and gradient similarity yu2020gradient (right) of different approaches for multi-task learning on HB-Medium. We consider single-task, multi-task via separate heads hessel2019multikumar2023offline and via task embeddings variants of our proposed BRC. We find that the task embeddings design outperforms other variants at all considered model scales and, interestingly, the separate heads design performs better than single task oracle only past certain model scale.
  • ...and 19 more figures