Table of Contents
Fetching ...

Single-Input Multi-Output Model Merging: Leveraging Foundation Models for Dense Multi-Task Learning

Juan Garcia Giraldo, Nikolaos Dimitriadis, Ke Wang, Pascal Frossard

TL;DR

This paper tackles single-input, multiple-output (SIMO) multi-task learning for dense prediction by investigating model merging of single-task checkpoints. It reveals that naive merging methods like Task Arithmetic or TIES fail in SIMO due to encoder–decoder representation misalignment, a challenge amplified by diverse losses across dense tasks. The authors propose two lightweight fixes—head realignment and representation realignment using parameter-efficient fine-tuning with LoRA adapters—to re-align the merged encoder with task-specific heads. Across NYUv2, Cityscapes, and Taskonomy, the approach yields competitive or superior multi-task performance with substantially reduced training data and compute compared to joint fine-tuning. In addition, the paper introduces a method to analyze task relationships via task vectors, enabling offline assessment of task compatibility and representation sensitivity to improve future SIMO deployments.

Abstract

Model merging is a flexible and computationally tractable approach to merge single-task checkpoints into a multi-task model. Prior work has solely focused on constrained multi-task settings where there is a one-to-one mapping between a sample and a task, overlooking the paradigm where multiple tasks may operate on the same sample, e.g., scene understanding. In this paper, we focus on the multi-task setting with single-input-multiple-outputs (SIMO) and show that it qualitatively differs from the single-input-single-output model merging settings studied in the literature due to the existence of task-specific decoders and diverse loss objectives. We identify that existing model merging methods lead to significant performance degradation, primarily due to representation misalignment between the merged encoder and task-specific decoders. We propose two simple and efficient fixes for the SIMO setting to re-align the feature representation after merging. Compared to joint fine-tuning, our approach is computationally effective and flexible, and sheds light into identifying task relationships in an offline manner. Experiments on NYUv2, Cityscapes, and a subset of the Taskonomy dataset demonstrate: (1) task arithmetic suffices to enable multi-task capabilities; however, the representations generated by the merged encoder has to be re-aligned with the task-specific heads; (2) the proposed architecture rivals traditional multi-task learning in performance but requires fewer samples and training steps by leveraging the existence of task-specific models.

Single-Input Multi-Output Model Merging: Leveraging Foundation Models for Dense Multi-Task Learning

TL;DR

This paper tackles single-input, multiple-output (SIMO) multi-task learning for dense prediction by investigating model merging of single-task checkpoints. It reveals that naive merging methods like Task Arithmetic or TIES fail in SIMO due to encoder–decoder representation misalignment, a challenge amplified by diverse losses across dense tasks. The authors propose two lightweight fixes—head realignment and representation realignment using parameter-efficient fine-tuning with LoRA adapters—to re-align the merged encoder with task-specific heads. Across NYUv2, Cityscapes, and Taskonomy, the approach yields competitive or superior multi-task performance with substantially reduced training data and compute compared to joint fine-tuning. In addition, the paper introduces a method to analyze task relationships via task vectors, enabling offline assessment of task compatibility and representation sensitivity to improve future SIMO deployments.

Abstract

Model merging is a flexible and computationally tractable approach to merge single-task checkpoints into a multi-task model. Prior work has solely focused on constrained multi-task settings where there is a one-to-one mapping between a sample and a task, overlooking the paradigm where multiple tasks may operate on the same sample, e.g., scene understanding. In this paper, we focus on the multi-task setting with single-input-multiple-outputs (SIMO) and show that it qualitatively differs from the single-input-single-output model merging settings studied in the literature due to the existence of task-specific decoders and diverse loss objectives. We identify that existing model merging methods lead to significant performance degradation, primarily due to representation misalignment between the merged encoder and task-specific decoders. We propose two simple and efficient fixes for the SIMO setting to re-align the feature representation after merging. Compared to joint fine-tuning, our approach is computationally effective and flexible, and sheds light into identifying task relationships in an offline manner. Experiments on NYUv2, Cityscapes, and a subset of the Taskonomy dataset demonstrate: (1) task arithmetic suffices to enable multi-task capabilities; however, the representations generated by the merged encoder has to be re-aligned with the task-specific heads; (2) the proposed architecture rivals traditional multi-task learning in performance but requires fewer samples and training steps by leveraging the existence of task-specific models.

Paper Structure

This paper contains 32 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparison of performance deterioration between dense prediction (Taskonomy taskonomy2018 with 5 tasks) and vision classification (8-task benchmark introduced in ilharco2023task) benchmarks as a function of the number. Each point corresponds to normalized performance of the Task Arithmetic ilharco2023task baseline for a $k$-task combination. Dense prediction combinations exhibit a steeper decrease compared to vision classification, indicating the increased difficulty of the setting.
  • Figure 2: Optimal merging coefficients per layer as found by Adamerging yang2023adamerging on NYUv2 for the DINOv2 architecture consisting of 12 repeating blocks of layers. Tasks deem important different parts of the representation; therefore applying computationally tractable but simple merging approaches like Task Arithmetic results in a representation misalignment in the encoder.
  • Figure 3: Visualization of task relationships for Cityscapes. Each plot presents the normalized performance on each dataset as a function of the task vector added to the model with a coefficient of the x-axis, forming the model $\bm{\theta}_{0}\xspace+\lambda \bm{\tau}$, for $\bm{\tau}\in\{\bm{\tau}_{seg}\xspace, \bm{\tau}_{part\_seg}\xspace,\bm{\tau}_{disp}\xspace\}$.
  • Figure 4: Qualitative results comparing MTL, Task Arithmetic, and our Representation Re-alignment technique on the Taskonomy dataset.
  • Figure 5: Visualization of task relationships for NYUv2. Each plot presents the normalized performance on each dataset as a function of the task vector added to the model with a coefficient of the x-axis, forming the model $\bm{\theta}_{0}\xspace+\lambda \bm{\tau}$, for $\bm{\tau}\in\{\bm{\tau}_{seg}\xspace, \bm{\tau}_{depth}\xspace,\bm{\tau}_{normals}\xspace\}$.
  • ...and 1 more figures