Multi-Task Learning with Multi-Task Optimization

Lu Bai; Abhishek Gupta; Yew-Soon Ong

Multi-Task Learning with Multi-Task Optimization

Lu Bai, Abhishek Gupta, Yew-Soon Ong

TL;DR

This paper investigates the proposed multi-task learning with multi-task optimization for solving various problem settings including image classification, scene understanding, and multi-target regression and confirms that the proposed method significantly advances the state-of-the-art in discovering sets of Pareto-optimized models.

Abstract

Multi-task learning solves multiple correlated tasks. However, conflicts may exist between them. In such circumstances, a single solution can rarely optimize all the tasks, leading to performance trade-offs. To arrive at a set of optimized yet well-distributed models that collectively embody different trade-offs in one algorithmic pass, this paper proposes to view Pareto multi-task learning through the lens of multi-task optimization. Multi-task learning is first cast as a multi-objective optimization problem, which is then decomposed into a diverse set of unconstrained scalar-valued subproblems. These subproblems are solved jointly using a novel multi-task gradient descent method, whose uniqueness lies in the iterative transfer of model parameters among the subproblems during the course of optimization. A theorem proving faster convergence through the inclusion of such transfers is presented. We investigate the proposed multi-task learning with multi-task optimization for solving various problem settings including image classification, scene understanding, and multi-target regression. Comprehensive experiments confirm that the proposed method significantly advances the state-of-the-art in discovering sets of Pareto-optimized models. Notably, on the large image dataset we tested on, namely NYUv2, the hypervolume convergence achieved by our method was found to be nearly two times faster than the next-best among the state-of-the-art.

Multi-Task Learning with Multi-Task Optimization

TL;DR

Abstract

Paper Structure (22 sections, 1 theorem, 35 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 22 sections, 1 theorem, 35 equations, 8 figures, 2 tables, 1 algorithm.

Introduction
Background
Conflicts in Multi-Task Learning
Multi-Objective Optimization
Pareto Multi-Task Learning
Multi-Task Optimization
Preliminaries
Casting Multi-Task Learning as Multi-Objective Optimization
Recasting Multi-Objective Optimization as Multi-Task Optimization
MT$^2$O and its Theoretical Analysis
Multi-Task Gradient Descent
Faster Convergence by Multi-task Transfer
Summary of the Proposed MT$^2$O
Time Complexity
Experiments
...and 7 more sections

Key Result

Theorem 1

Suppose there exist $i$ and $j$ such that $H_i^t\neq H_j^t$ and the transfer coefficient $M^t_{ij}$ satisfies where $T_0$ is a nonnegative integer satisfying then, under mgd, $\|\tilde{\bm\theta^t}\|$ converges to zero faster than when there is no transfer if $\exists\ T_0>0$ and the step size $\alpha$ satisfies

Figures (8)

Figure 1: Finding a set of Pareto MTL models in one algorithmic pass by means of jointly solving related subproblems with multi-task optimization. (a) Turning MTL into a set of subproblems. (b) Each subproblem provides one Pareto optimal model. Different Pareto optimal models embody different trade-offs among the tasks.
Figure 2: The results for the synthetic examples. The top row shows the approximated Pareto front of the first run, and the bottom row shows the HV value convergence curves during optimization, calculated using the reference point (1.1,1.1). The HV values are averaged over 30 runs.
Figure 3: Architecture of the MTL network used for each subproblem for the MultiMNIST, MultiFashionMNIST, Multi-(Fashion+MNIST) datasets.
Figure 4: The results for the three MNIST-like datasets. The top row shows the test accuracies above 0.4, the middle row shows the training losses below 2, and the bottom row shows the HV value convergence curves during the training process, calculated using the reference point (2,2).
Figure 5: Attribute-wise misclassification error percentage on CelebA dataset. Lower values indicate better performance.
...and 3 more figures

Theorems & Definitions (6)

Definition 1: Pareto Dominance
Definition 2: Pareto Optimality
Definition 3: Pareto Set
Definition 4: Pareto Front
Theorem 1
proof

Multi-Task Learning with Multi-Task Optimization

TL;DR

Abstract

Multi-Task Learning with Multi-Task Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (6)