Table of Contents
Fetching ...

Learning Representation for Multitask learning through Self Supervised Auxiliary learning

Seokwon Shin, Hyungrok Do, Youngdoo Son

TL;DR

This work proposes a novel approach called Dummy Gradient norm Regularization that aims to improve the universality of the representations generated by the shared encoder to improve the universality of the shared encoder's representations.

Abstract

Multi-task learning is a popular machine learning approach that enables simultaneous learning of multiple related tasks, improving algorithmic efficiency and effectiveness. In the hard parameter sharing approach, an encoder shared through multiple tasks generates data representations passed to task-specific predictors. Therefore, it is crucial to have a shared encoder that provides decent representations for every and each task. However, despite recent advances in multi-task learning, the question of how to improve the quality of representations generated by the shared encoder remains open. To address this gap, we propose a novel approach called Dummy Gradient norm Regularization that aims to improve the universality of the representations generated by the shared encoder. Specifically, the method decreases the norm of the gradient of the loss function with repect to dummy task-specific predictors to improve the universality of the shared encoder's representations. Through experiments on multiple multi-task learning benchmark datasets, we demonstrate that DGR effectively improves the quality of the shared representations, leading to better multi-task prediction performances. Applied to various classifiers, the shared representations generated by DGR also show superior performance compared to existing multi-task learning methods. Moreover, our approach takes advantage of computational efficiency due to its simplicity. The simplicity also allows us to seamlessly integrate DGR with the existing multi-task learning algorithms.

Learning Representation for Multitask learning through Self Supervised Auxiliary learning

TL;DR

This work proposes a novel approach called Dummy Gradient norm Regularization that aims to improve the universality of the representations generated by the shared encoder to improve the universality of the shared encoder's representations.

Abstract

Multi-task learning is a popular machine learning approach that enables simultaneous learning of multiple related tasks, improving algorithmic efficiency and effectiveness. In the hard parameter sharing approach, an encoder shared through multiple tasks generates data representations passed to task-specific predictors. Therefore, it is crucial to have a shared encoder that provides decent representations for every and each task. However, despite recent advances in multi-task learning, the question of how to improve the quality of representations generated by the shared encoder remains open. To address this gap, we propose a novel approach called Dummy Gradient norm Regularization that aims to improve the universality of the representations generated by the shared encoder. Specifically, the method decreases the norm of the gradient of the loss function with repect to dummy task-specific predictors to improve the universality of the shared encoder's representations. Through experiments on multiple multi-task learning benchmark datasets, we demonstrate that DGR effectively improves the quality of the shared representations, leading to better multi-task prediction performances. Applied to various classifiers, the shared representations generated by DGR also show superior performance compared to existing multi-task learning methods. Moreover, our approach takes advantage of computational efficiency due to its simplicity. The simplicity also allows us to seamlessly integrate DGR with the existing multi-task learning algorithms.
Paper Structure (13 sections, 1 theorem, 6 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 13 sections, 1 theorem, 6 equations, 3 figures, 4 tables, 1 algorithm.

Key Result

theorem thmcountertheorem

Given a task-specific predictor $\psi_{\theta_{\mathcal{T}_{k}}^{\Delta}}$, if $\mathcal{L}_{k}$ is convex with respect to $\theta_{\mathcal{T}_{k}}^{\Delta}$, the universality of the shared encoder is inversely proportional to the Frobenius norm of the gradient of the loss function with respect to

Figures (3)

  • Figure 1: A schematic overview of the proposed DGR, which consists of a shared encoder, task-specific predictors, and task-specific dummy predictors. During the forward pass, task-specific predictors produce the actual prediction for each task, while the backward pass minimizes the sum of task-specific losses and encourages the universality of the shared encoder using dummy predictors. The black and red solid lines represent the forward pass during the training and inference phases, respectively, while the blue dashed line represents the direction of backpropagation for training.
  • Figure 2: Comparison result of average $\Delta_{\text{MTL}}$ over three trials on the UTKFace dataset upon integrating each method into SAM, PTA, and Ours. The bars colored in gray indicate the results obtained using each method independently.
  • Figure 3: (a) Average test performance over 3 trials on the UTKFace dataset, where the learned representation using MTL was fixed, and relatively simple classifiers were used to replace the original classifier. (b) The t-SNE visualization results of shared representations generated by the Vanilla (left) and DGR (right). Each row corresponds to a specific task.

Theorems & Definitions (3)

  • definition thmcounterdefinition: Optimal Task-Specific Predictors
  • definition thmcounterdefinition: Universality of Shared Encoder
  • theorem thmcountertheorem: Inverse Proportionality