Table of Contents
Fetching ...

Beyond Shared Hierarchies: Deep Multitask Learning through Soft Layer Ordering

Elliot Meyerson, Risto Miikkulainen

TL;DR

The paper challenges the standard parallel ordering assumption in deep multitask learning and demonstrates that allowing flexible, task-specific layer usage through permuted and soft ordering can significantly improve cross-task sharing. It introduces a soft ordering mechanism that jointly learns how to apply shared layers at different depths for different tasks, outperforming fixed-order MTL and single-task baselines across MNIST, UCI, Omniglot, and CelebA. The results reveal that shared layers can serve as generalizable primitives assembled in task-dependent ways, suggesting a path toward scalable, modular building blocks for unseen tasks. Overall, soft ordering not only boosts performance but also provides insights into the functional roles of learned layers across diverse tasks.

Abstract

Existing deep multitask learning (MTL) approaches align layers shared between tasks in a parallel ordering. Such an organization significantly constricts the types of shared structure that can be learned. The necessity of parallel ordering for deep MTL is first tested by comparing it with permuted ordering of shared layers. The results indicate that a flexible ordering can enable more effective sharing, thus motivating the development of a soft ordering approach, which learns how shared layers are applied in different ways for different tasks. Deep MTL with soft ordering outperforms parallel ordering methods across a series of domains. These results suggest that the power of deep MTL comes from learning highly general building blocks that can be assembled to meet the demands of each task.

Beyond Shared Hierarchies: Deep Multitask Learning through Soft Layer Ordering

TL;DR

The paper challenges the standard parallel ordering assumption in deep multitask learning and demonstrates that allowing flexible, task-specific layer usage through permuted and soft ordering can significantly improve cross-task sharing. It introduces a soft ordering mechanism that jointly learns how to apply shared layers at different depths for different tasks, outperforming fixed-order MTL and single-task baselines across MNIST, UCI, Omniglot, and CelebA. The results reveal that shared layers can serve as generalizable primitives assembled in task-dependent ways, suggesting a path toward scalable, modular building blocks for unseen tasks. Overall, soft ordering not only boosts performance but also provides insights into the functional roles of learned layers across diverse tasks.

Abstract

Existing deep multitask learning (MTL) approaches align layers shared between tasks in a parallel ordering. Such an organization significantly constricts the types of shared structure that can be learned. The necessity of parallel ordering for deep MTL is first tested by comparing it with permuted ordering of shared layers. The results indicate that a flexible ordering can enable more effective sharing, thus motivating the development of a soft ordering approach, which learns how shared layers are applied in different ways for different tasks. Deep MTL with soft ordering outperforms parallel ordering methods across a series of domains. These results suggest that the power of deep MTL comes from learning highly general building blocks that can be assembled to meet the demands of each task.

Paper Structure

This paper contains 22 sections, 7 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Classes of existing deep multitask learning architectures. (a) Classical approaches add a task-specific decoder to the output of the core single-task model for each task; (b) Column-based approaches include a network column for each task, and define a mechanism for sharing between columns; (c) Supervision at custom depths adds output decoders at depths based on a task hierarchy; (d) Universal representations adapts each layer with a small number of task-specific scaling parameters. Underlying each of these approaches is the assumption of parallel ordering of shared layers (Section \ref{['subsec:fhs']}): each one requires aligned sequences of feature extractors across tasks.
  • Figure 2: Fitting two random tasks. (a) The dotted lines show that permuted ordering fits $n$ samples as well as parallel fits $n/2$ for linear networks; (b) For ReLU networks, permuted ordering enjoys a similar advantage. Thus, permuted ordering of shared layers eases integration of information across disparate tasks.
  • Figure 3: Soft ordering of shared layers. Sample soft ordering network with three shared layers. Soft ordering (Eq. \ref{['eq:softordering']}) generalizes Eqs. \ref{['eq:hardsharing']} and \ref{['eq:permutedsharing']}, by learning a tensor $S$ of task-specific scaling parameters. $S$ is learned jointly with the $F_j$, to allow flexible sharing across tasks and depths. The $F_j$ in this figure each include a shared weight layer and any nonlinearity. This architecture enables the learning of layers that are used in different ways at different depths for different tasks.
  • Figure 4: MNIST results. (a) Relative performance of permuted and soft ordering compared to parallel ordering improves as the number of tasks increases, showing how flexibility of order can help in scaling to more tasks. Note that cost savings of multitask over single task models in terms of number of trainable parameters scales linearly with the number of tasks. For a representative two-task soft order experiment (b) the layer-wise distance between scalings of the tasks increases by iteration, and (c) the scalings move towards a hard ordering. (d) The final learned relative scale of each shared layer at each depth for each task is indicated by shading, with the strongest path drawn, showing that a distinct soft order is learned for each task ($\bullet$ marks the shared model boundary).
  • Figure 5: UCI data sets and results. (a) The ten UCI tasks used in joint training; the varying types of problems and dataset characteristics show the diversity of this set of tasks. (b) Mean test error over all ten tasks by iteration. Permuted and parallel order show no improvement after the first 1000 iterations, while soft order decisively outperforms the other methods.
  • ...and 2 more figures