Single-Input Multi-Output Model Merging: Leveraging Foundation Models for Dense Multi-Task Learning
Juan Garcia Giraldo, Nikolaos Dimitriadis, Ke Wang, Pascal Frossard
TL;DR
This paper tackles single-input, multiple-output (SIMO) multi-task learning for dense prediction by investigating model merging of single-task checkpoints. It reveals that naive merging methods like Task Arithmetic or TIES fail in SIMO due to encoder–decoder representation misalignment, a challenge amplified by diverse losses across dense tasks. The authors propose two lightweight fixes—head realignment and representation realignment using parameter-efficient fine-tuning with LoRA adapters—to re-align the merged encoder with task-specific heads. Across NYUv2, Cityscapes, and Taskonomy, the approach yields competitive or superior multi-task performance with substantially reduced training data and compute compared to joint fine-tuning. In addition, the paper introduces a method to analyze task relationships via task vectors, enabling offline assessment of task compatibility and representation sensitivity to improve future SIMO deployments.
Abstract
Model merging is a flexible and computationally tractable approach to merge single-task checkpoints into a multi-task model. Prior work has solely focused on constrained multi-task settings where there is a one-to-one mapping between a sample and a task, overlooking the paradigm where multiple tasks may operate on the same sample, e.g., scene understanding. In this paper, we focus on the multi-task setting with single-input-multiple-outputs (SIMO) and show that it qualitatively differs from the single-input-single-output model merging settings studied in the literature due to the existence of task-specific decoders and diverse loss objectives. We identify that existing model merging methods lead to significant performance degradation, primarily due to representation misalignment between the merged encoder and task-specific decoders. We propose two simple and efficient fixes for the SIMO setting to re-align the feature representation after merging. Compared to joint fine-tuning, our approach is computationally effective and flexible, and sheds light into identifying task relationships in an offline manner. Experiments on NYUv2, Cityscapes, and a subset of the Taskonomy dataset demonstrate: (1) task arithmetic suffices to enable multi-task capabilities; however, the representations generated by the merged encoder has to be re-aligned with the task-specific heads; (2) the proposed architecture rivals traditional multi-task learning in performance but requires fewer samples and training steps by leveraging the existence of task-specific models.
