Table of Contents
Fetching ...

Adapting Auxiliary Losses Using Gradient Similarity

Yunshu Du, Wojciech M. Czarnecki, Siddhant M. Jayakumar, Mehrdad Farajtabar, Razvan Pascanu, Balaji Lakshminarayanan

TL;DR

The paper addresses data inefficiency by using auxiliary losses to boost main-task learning, but auxiliary tasks can hinder progress. It introduces gradient cosine similarity as an adaptive mechanism to gate auxiliary updates, ensuring the main loss converges to a local minimum while enabling positive transfer when aligned. The approach is validated across diverse domains—ImageNet-classification pairings, rotated MNIST, gridworld RL, and Atari games—showing it can detect and block negative transfer and, in many cases, accelerate learning. This strategy reduces the need for hand-tuned weighting of auxiliary losses and enhances practical data efficiency in both supervised and reinforcement learning settings.

Abstract

One approach to deal with the statistical inefficiency of neural networks is to rely on auxiliary losses that help to build useful representations. However, it is not always trivial to know if an auxiliary task will be helpful for the main task and when it could start hurting. We propose to use the cosine similarity between gradients of tasks as an adaptive weight to detect when an auxiliary loss is helpful to the main loss. We show that our approach is guaranteed to converge to critical points of the main task and demonstrate the practical usefulness of the proposed algorithm in a few domains: multi-task supervised learning on subsets of ImageNet, reinforcement learning on gridworld, and reinforcement learning on Atari games.

Adapting Auxiliary Losses Using Gradient Similarity

TL;DR

The paper addresses data inefficiency by using auxiliary losses to boost main-task learning, but auxiliary tasks can hinder progress. It introduces gradient cosine similarity as an adaptive mechanism to gate auxiliary updates, ensuring the main loss converges to a local minimum while enabling positive transfer when aligned. The approach is validated across diverse domains—ImageNet-classification pairings, rotated MNIST, gridworld RL, and Atari games—showing it can detect and block negative transfer and, in many cases, accelerate learning. This strategy reduces the need for hand-tuned weighting of auxiliary losses and enhances practical data efficiency in both supervised and reinforcement learning settings.

Abstract

One approach to deal with the statistical inefficiency of neural networks is to rely on auxiliary losses that help to build useful representations. However, it is not always trivial to know if an auxiliary task will be helpful for the main task and when it could start hurting. We propose to use the cosine similarity between gradients of tasks as an adaptive weight to detect when an auxiliary loss is helpful to the main loss. We show that our approach is guaranteed to converge to critical points of the main task and demonstrate the practical usefulness of the proposed algorithm in a few domains: multi-task supervised learning on subsets of ImageNet, reinforcement learning on gridworld, and reinforcement learning on Atari games.

Paper Structure

This paper contains 23 sections, 3 theorems, 17 equations, 12 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

Given any gradient vector field $G({\bm{\theta}}) = \nabla_{\bm{\theta}} \mathcal{L}({\bm{\theta}})$ and any vector field $V({\bm{\theta}})$ (such as the gradient of another loss function, or an arbitrary set of updates), an update rule of the form converges to the local minimum of $\mathcal{L}$ given small enough $\alpha^{(t)}$.

Figures (12)

  • Figure 1: Illustration of cosine similarity between gradients on synthetic loss surfaces.
  • Figure 2: Positive and negative examples of our proposed method. Top row: combine $L_1$ with the gradient of an auxiliary loss $L_3$. Middle row: combine $L_1$ with a vector field $V$. Bottom row: combine $L_2$ with the gradient of an auxiliary loss $L_4$. Our method (the last column) converges in all cases, while simply adding a gradient or vector field leads to divergence (the second column).
  • Figure 3: Experiments on ImageNet class pairs. (a): gradient cosine similarity is higher for near pairs and lower for far pairs. (b) and (c): testing accuracy on single-task (dotted), multi-task (dashed), and our method (solid).
  • Figure 4: Top row: expected learning curves for cross-environment distillation experiments, averaged over $1,000$ partially observable gridworlds. The teacher's policy is based on Q-Learning, its performance in a new environment (with modified positive rewards) is represented by the top dotted line. The bottom dotted line represents random policy. Each column represents a different temperature applied to the teacher policy. $0$ temperature is the original deterministic greedy policy given by Q-Learning. Bottom row: expected learning curves for same-environment distillation experiments when the teacher is perfect, where, trusting the teacher everywhere is optimum.
  • Figure 5: Results on Breakout. We perform distillation of a sub-optimal teacher policy as an auxiliary task.
  • ...and 7 more figures

Theorems & Definitions (6)

  • Proposition 1
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • proof