Multi-Objective Optimization for Sparse Deep Multi-Task Learning

S. S. Hotegni; M. Berkemeier; S. Peitz

Multi-Objective Optimization for Sparse Deep Multi-Task Learning

S. S. Hotegni, M. Berkemeier, S. Peitz

TL;DR

The paper tackles conflicting objectives in deep learning by formulating a multi-objective optimization framework that jointly minimizes primary task losses and a sparsity objective. It leverages a Modified Weighted Chebyshev scalarization with an Augmented Lagrangian to efficiently compute a Pareto front of DNN parameterizations, while GrOWL regularization induces adaptive sparsity. The proposed Monitored Deep Multi-Task Network (MDMTN) architecture combines a shared backbone with lightweight task-specific monitors to preserve cross-task information while enabling sparsity-driven pruning during training. Empirical results on MultiMNIST and a new Cifar10Mnist dataset show that MDMTN often achieves superior or comparable main-task performance with substantial parameter reduction and favorable task-sharing properties, illustrating the practical viability of adaptive sparsification in multi-task settings.

Abstract

Different conflicting optimization criteria arise naturally in various Deep Learning scenarios. These can address different main tasks (i.e., in the setting of Multi-Task Learning), but also main and secondary tasks such as loss minimization versus sparsity. The usual approach is a simple weighting of the criteria, which formally only works in the convex setting. In this paper, we present a Multi-Objective Optimization algorithm using a modified Weighted Chebyshev scalarization for training Deep Neural Networks (DNNs) with respect to several tasks. By employing this scalarization technique, the algorithm can identify all optimal solutions of the original problem while reducing its complexity to a sequence of single-objective problems. The simplified problems are then solved using an Augmented Lagrangian method, enabling the use of popular optimization techniques such as Adam and Stochastic Gradient Descent, while efficaciously handling constraints. Our work aims to address the (economical and also ecological) sustainability issue of DNN models, with a particular focus on Deep Multi-Task models, which are typically designed with a very large number of weights to perform equally well on multiple tasks. Through experiments conducted on two Machine Learning datasets, we demonstrate the possibility of adaptively sparsifying the model during training without significantly impacting its performance, if we are willing to apply task-specific adaptations to the network weights. Code is available at https://github.com/salomonhotegni/MDMTN

Multi-Objective Optimization for Sparse Deep Multi-Task Learning

TL;DR

Abstract

Paper Structure (16 sections, 1 theorem, 7 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 16 sections, 1 theorem, 7 equations, 8 figures, 1 table, 1 algorithm.

Introduction
Related Works
Multi-Objective Optimization (MOO)
Multi-Task Learning (MTL)
Model Sparsification (MS)
Methodology
Multi-Objective Optimization
Weighted Chebyshev Scalarization
Augmented Lagrangian Transformation
Group Ordered Weighted $l_1$ (GrOWL)
Deep Multi-Task Learning Model
Experiments
Sparsification Effects on Model Performance
MultiMNIST dataset:
Cifar10Mnist dataset:
...and 1 more sections

Key Result

Theorem 1

For any importance vector $k$, with $\ k_i \ge 0,\ \forall i\in\{0,...,m\}$, a decision variable $x^* \in \Omega$ is an optimal solution of the Modified Weighted Chebyshev scalarization problem if and only if $x^*$ is properly Pareto optimal.

Figures (8)

Figure 1: Diagram of the Monitored Deep Multi-Task Network (MDMTN) with 2 main tasks.
Figure 1: Results on the MultiMNIST and Cifar10Mnist data. Tasks 1 and 2 represent the classification of left and right digits for MultiMNIST and CIFAR-10 and MNIST images for Cifar10Mnist, respectively. For each metric, the best performance per architecture is boxed, while the best overall performance is in bold. Higher SR and CR are preferable, while PS $> 1$ is better.
Figure 2: Sample images from the Cifar10Mnist dataset
Figure 3: (MultiMNIST): 3D and 2D visualizations of $39$ Pareto optimal points $(k_0 \neq 0)$ obtained with different preference vectors $k$ (considering the loss functions), and clustered by sparsity rate, as specified in the table on the right. We add the Pareto front obtained without taking into account the sparsity objective to the 2D projection $(k_0 = 0,\ \text{in black})$.
Figure 4: (Cifar10Mnist): 2D projection of $35$ Pareto optimal points ($k_0 \neq 0$, in red) and the Pareto front obtained without taking into account the sparsity objective ($k_0 = 0$, in black).
...and 3 more figures

Theorems & Definitions (3)

Definition 1
Definition 2
Theorem 1: kaliszewski2012quantitative, pp. 48-50.

Multi-Objective Optimization for Sparse Deep Multi-Task Learning

TL;DR

Abstract

Multi-Objective Optimization for Sparse Deep Multi-Task Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (3)