Table of Contents
Fetching ...

Transforming Vision Transformer: Towards Efficient Multi-Task Asynchronous Learning

Hanwen Zhong, Jiaxin Chen, Yutong Zhang, Di Huang, Yunhong Wang

TL;DR

This work tackles efficient multi-task learning for Vision Transformers by identifying inefficiencies in existing MoE and LoRA-based methods. It introduces EMTAL, which transforms a pre-trained ViT into a MoEfied LoRA-based multi-task learner, employs Quality Retaining optimization to support asynchronous task convergence, and uses a router fading strategy to reparameterize learned knowledge back into a unified backbone. The MoEfied LoRA component creates a Mixture of Low-Rank Experts by clustering similar weight columns and applying low-rank LoRA updates, while QR preserves high-quality knowledge across tasks. Empirical results on Multi-task FGVC, VTAB-1k, NYUv2, and few-shot settings show EMTAL achieves state-of-the-art accuracy with substantially fewer tunable parameters and no additional inference cost, highlighting its practical impact for scalable, task-rich vision systems.

Abstract

Multi-Task Learning (MTL) for Vision Transformer aims at enhancing the model capability by tackling multiple tasks simultaneously. Most recent works have predominantly focused on designing Mixture-of-Experts (MoE) structures and in tegrating Low-Rank Adaptation (LoRA) to efficiently perform multi-task learning. However, their rigid combination hampers both the optimization of MoE and the ef fectiveness of reparameterization of LoRA, leading to sub-optimal performance and low inference speed. In this work, we propose a novel approach dubbed Efficient Multi-Task Learning (EMTAL) by transforming a pre-trained Vision Transformer into an efficient multi-task learner during training, and reparameterizing the learned structure for efficient inference. Specifically, we firstly develop the MoEfied LoRA structure, which decomposes the pre-trained Transformer into a low-rank MoE structure and employ LoRA to fine-tune the parameters. Subsequently, we take into account the intrinsic asynchronous nature of multi-task learning and devise a learning Quality Retaining (QR) optimization mechanism, by leveraging the historical high-quality class logits to prevent a well-trained task from performance degradation. Finally, we design a router fading strategy to integrate the learned parameters into the original Transformer, archiving efficient inference. Extensive experiments on public benchmarks demonstrate the superiority of our method, compared to the state-of-the-art multi-task learning approaches.

Transforming Vision Transformer: Towards Efficient Multi-Task Asynchronous Learning

TL;DR

This work tackles efficient multi-task learning for Vision Transformers by identifying inefficiencies in existing MoE and LoRA-based methods. It introduces EMTAL, which transforms a pre-trained ViT into a MoEfied LoRA-based multi-task learner, employs Quality Retaining optimization to support asynchronous task convergence, and uses a router fading strategy to reparameterize learned knowledge back into a unified backbone. The MoEfied LoRA component creates a Mixture of Low-Rank Experts by clustering similar weight columns and applying low-rank LoRA updates, while QR preserves high-quality knowledge across tasks. Empirical results on Multi-task FGVC, VTAB-1k, NYUv2, and few-shot settings show EMTAL achieves state-of-the-art accuracy with substantially fewer tunable parameters and no additional inference cost, highlighting its practical impact for scalable, task-rich vision systems.

Abstract

Multi-Task Learning (MTL) for Vision Transformer aims at enhancing the model capability by tackling multiple tasks simultaneously. Most recent works have predominantly focused on designing Mixture-of-Experts (MoE) structures and in tegrating Low-Rank Adaptation (LoRA) to efficiently perform multi-task learning. However, their rigid combination hampers both the optimization of MoE and the ef fectiveness of reparameterization of LoRA, leading to sub-optimal performance and low inference speed. In this work, we propose a novel approach dubbed Efficient Multi-Task Learning (EMTAL) by transforming a pre-trained Vision Transformer into an efficient multi-task learner during training, and reparameterizing the learned structure for efficient inference. Specifically, we firstly develop the MoEfied LoRA structure, which decomposes the pre-trained Transformer into a low-rank MoE structure and employ LoRA to fine-tune the parameters. Subsequently, we take into account the intrinsic asynchronous nature of multi-task learning and devise a learning Quality Retaining (QR) optimization mechanism, by leveraging the historical high-quality class logits to prevent a well-trained task from performance degradation. Finally, we design a router fading strategy to integrate the learned parameters into the original Transformer, archiving efficient inference. Extensive experiments on public benchmarks demonstrate the superiority of our method, compared to the state-of-the-art multi-task learning approaches.
Paper Structure (22 sections, 15 equations, 5 figures, 8 tables)

This paper contains 22 sections, 15 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: FFN as Mixture of Low-rank Experts. Given an up-projection weight matrix in FFN, a straightforward way of splitting it into MoE is to divide every $K$ channels into separate experts, resulting in highly dissimilar experts and a high-rank MoE, which is inherently unsuitable for integration with LoRA. In contrast, our proposed MoLE approach rearranges the weight matrix into groups of similar channels as experts, creating specialized low-rank experts that are better suited for integrating with LoRA.
  • Figure 2: Summary of representative architectures of multi-task learning.
  • Figure 3: Illustration of the proposed EMTAL framework. Given a pre-trained ViT, we firstly decompose it into a MoE-based multi-task learner by using the balanced k-means. LoRA is then applied to the low-rank experts, creating an efficient multi-task learner dubbed MoEfied LoRA. During multi-task optimization, the Quality Retaining is employed to maintain the high-quality knowledge for tasks that have already converged. Finally, with the aid of the router fading strategy, the learned knowledge is reparameterized back into the pre-trained ViT, eliminating the extra inference cost.
  • Figure 4: Comparison of the low-rank properties by using the vanilla MoE and the proposed MoLE, based on the Ky Fan 2-k norm DBLP:journals/jgo/DoanV22. A higher value signifies a stronger low-rank property.
  • Figure 5: Comparison results using various separate training approaches in the context of few-shot learning on the multi-task FGVC datasets.