Table of Contents
Fetching ...

Which Tasks Should Be Learned Together in Multi-task Learning?

Trevor Standley, Amir R. Zamir, Dawn Chen, Leonidas Guibas, Jitendra Malik, Silvio Savarese

TL;DR

This work tackles which tasks should be learned together in multi-task learning under a fixed inference-time budget. It introduces a task grouping framework that evaluates all non-empty task subsets to form a small set of networks, each solving a subset of tasks, to optimize overall accuracy within the budget. The authors show that task relationships are highly setup-dependent and that naive joint training can underperform compared with carefully grouped task networks; they offer two training-time approximations—ESA and HOA—that make finding near-optimal groupings practical. Across multiple settings, their approach outperforms single-task and full joint baselines, highlighting the importance of automatic task grouping for real-time multi-task vision systems.

Abstract

Many computer vision applications require solving multiple tasks in real-time. A neural network can be trained to solve multiple tasks simultaneously using multi-task learning. This can save computation at inference time as only a single network needs to be evaluated. Unfortunately, this often leads to inferior overall performance as task objectives can compete, which consequently poses the question: which tasks should and should not be learned together in one network when employing multi-task learning? We study task cooperation and competition in several different learning settings and propose a framework for assigning tasks to a few neural networks such that cooperating tasks are computed by the same neural network, while competing tasks are computed by different networks. Our framework offers a time-accuracy trade-off and can produce better accuracy using less inference time than not only a single large multi-task neural network but also many single-task networks.

Which Tasks Should Be Learned Together in Multi-task Learning?

TL;DR

This work tackles which tasks should be learned together in multi-task learning under a fixed inference-time budget. It introduces a task grouping framework that evaluates all non-empty task subsets to form a small set of networks, each solving a subset of tasks, to optimize overall accuracy within the budget. The authors show that task relationships are highly setup-dependent and that naive joint training can underperform compared with carefully grouped task networks; they offer two training-time approximations—ESA and HOA—that make finding near-optimal groupings practical. Across multiple settings, their approach outperforms single-task and full joint baselines, highlighting the importance of automatic task grouping for real-time multi-task vision systems.

Abstract

Many computer vision applications require solving multiple tasks in real-time. A neural network can be trained to solve multiple tasks simultaneously using multi-task learning. This can save computation at inference time as only a single network needs to be evaluated. Unfortunately, this often leads to inferior overall performance as task objectives can compete, which consequently poses the question: which tasks should and should not be learned together in one network when employing multi-task learning? We study task cooperation and competition in several different learning settings and propose a framework for assigning tasks to a few neural networks such that cooperating tasks are computed by the same neural network, while competing tasks are computed by different networks. Our framework offers a time-accuracy trade-off and can produce better accuracy using less inference time than not only a single large multi-task neural network but also many single-task networks.

Paper Structure

This paper contains 12 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Given five example tasks to solve, there are many ways that they can be split into task groups for multi-task learning. How do we find the best one? We propose a computational framework that, for instance, suggests the following grouping to achieve the lowest total loss, using a computational budget of 2.5 units: train network A to solve Semantic Segmentation, Depth Estimation, and Surface Normal Prediction; train network B to solve Keypoint Detection, Edge Detection, and Surface Normal Prediction; train network C with a less computationally expensive encoder to solve Surface Normal Prediction alone; including Surface Normals as an output in the first two networks were found advantageous for improving the other outputs, while the best Normals were predicted by the third network. This task grouping outperforms all other feasible ones, including learning all five tasks in one large network or using five dedicated smaller networks.
  • Figure 2: Performance/inference time trade-off for various methods in Setting 1. We do not report error bars because the test set is large enough that standard errors are too small to be shown.
  • Figure 3: The task groups picked by each of our techniques for integer budgets between 1 and 5. Networks are shown as $\circ$ (full-size) or $\circ$ (half-size). Networks are connected to the tasks for which they compute predictions. s: Semantic Segmentation, d: Depth Estimation, n: Surface Normal Prediction, k: Keypoint Detection, e: Edge Detection. Dotted edges represent unused decoders. For example, the green highlighted solution consists of two half-size networks and a full-size network. The full-size network solves Depth Estimation, Surface Normal Prediction, and Keypoint Detection. One half-size network solves Semantic Segmentation and the other solves Edge Detection. The total loss for all five tasks is 0.455. The groupings for fractional budgets are shown in the supplemental material.
  • Figure 4: Performance/inference time trade-off in Setting 2.
  • Figure 5: Performance/inference time trade-off in Setting 3.
  • ...and 2 more figures