Table of Contents
Fetching ...

Cross-Task Affinity Learning for Multitask Dense Scene Predictions

Dimitrios Sinodinos, Narges Armanfard

TL;DR

This paper introduces the Cross-Task Affinity Learning (CTAL) module, a lightweight framework that enhances task refinement in multitask networks by optimizing task affin-ity matrices for parameter-efficient grouped convolutions without concern for information loss.

Abstract

Multitask learning (MTL) has become prominent for its ability to predict multiple tasks jointly, achieving better per-task performance with fewer parameters than single-task learning. Recently, decoder-focused architectures have significantly improved multitask performance by refining task predictions using features from related tasks. However, most refinement methods struggle to efficiently capture both local and long-range dependencies between task-specific representations and cross-task patterns. In this paper, we introduce the Cross-Task Affinity Learning (CTAL) module, a lightweight framework that enhances task refinement in multitask networks. CTAL effectively captures local and long-range cross-task interactions by optimizing task affinity matrices for parameter-efficient grouped convolutions without concern for information loss. Our results demonstrate state-of-the-art MTL performance for both CNN and transformer backbones, using significantly fewer parameters than single-task learning. Our code is publicly available at https://github.com/Armanfard-Lab/EMA-Net.

Cross-Task Affinity Learning for Multitask Dense Scene Predictions

TL;DR

This paper introduces the Cross-Task Affinity Learning (CTAL) module, a lightweight framework that enhances task refinement in multitask networks by optimizing task affin-ity matrices for parameter-efficient grouped convolutions without concern for information loss.

Abstract

Multitask learning (MTL) has become prominent for its ability to predict multiple tasks jointly, achieving better per-task performance with fewer parameters than single-task learning. Recently, decoder-focused architectures have significantly improved multitask performance by refining task predictions using features from related tasks. However, most refinement methods struggle to efficiently capture both local and long-range dependencies between task-specific representations and cross-task patterns. In this paper, we introduce the Cross-Task Affinity Learning (CTAL) module, a lightweight framework that enhances task refinement in multitask networks. CTAL effectively captures local and long-range cross-task interactions by optimizing task affinity matrices for parameter-efficient grouped convolutions without concern for information loss. Our results demonstrate state-of-the-art MTL performance for both CNN and transformer backbones, using significantly fewer parameters than single-task learning. Our code is publicly available at https://github.com/Armanfard-Lab/EMA-Net.
Paper Structure (27 sections, 13 equations, 4 figures, 11 tables)

This paper contains 27 sections, 13 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: A network diagram of the task-prediction distillation framework using deep supervision at multiple feature scales and using the CTAL module after cross-scale fusion for task-refinement. An input image is passed through a shared encoder to generate a set of features at 4 different scales relative to the input. We compute the initial predictions using each feature scale and then upsample all task-specific feature maps to the highest scale and combine them in the cross-scale fusion blocks. Finally, the output of each task-specific cross-scale fusion is passed as input to the CTAL module, where the features are refined and then processed by task-specific decoders to obtain the final predictions.
  • Figure 2: A diagram of the Cross-Task Affinity Learning (CTAL) module that is comprised of three stages: Intra-Task, Inter-Task, and Task-Specific Diffusion. We compute the Gram matrix of the flattened and normalized views of the initial task prediction features $\bm{F}^i_{t_k}$ to obtain the task-specific affinity matrices $\bm{A}^i_{t_k}$. We then reshape $\bm{A}^i_{t_k}$ to the original spatial dimensions and perform an interleaved concatenation of all $HW$ channels for each task to obtain the joint affinity matrix $\bm{M}$. Each of the $HW$ sets of $N$ channels is processed by a task-specific grouped convolution ($\text{G Conv}_{t_k}$) and then diffuses its information to a projected view of $\bm{F}^i_{t_k}$ via matrix multiplication and an element-wise weighted sum to obtain the final refined features $\bm{F}^r_{t_k}$.
  • Figure 3: An illustration of the interleave concatenation procedure used to align the channels for grouped convolutions in a two-task scenario.
  • Figure 4: A visual comparison of the predictions from the single task baseline (STL) and CTAL$_{MS}$ (Ours) . The two images and the ground truths (GT) are from the validation set of NYUv2.