Mixture of Latent Experts Using Tensor Products
Zhan Su, Fengran Mo, Prayag Tiwari, Benyou Wang, Jian-Yun Nie, Jakob Grue Simonsen
TL;DR
The paper tackles negative transfer in multi-task learning by introducing TensorPoly, a modular language model that uses tensor-product based adapters (TLoRA) and two routing schemes to enable richer cross-task sharing with high parameter efficiency. It defines and combines LoRA, Poly, and an entangled-tensor formulation to support higher-order interactions, presenting two routing variants (TensorPoly-I and TensorPoly-II) with distinct granularity. Experiments on the T0 multi-task benchmark show modular LMs outperform dense baselines, with TensorPoly-I delivering strong performance and exceptional parameter efficiency, while TensorPoly-II provides limited gains. The work highlights the importance of routing over mere parameter addition in latent-expert approaches and points to future directions for domain-specific tensor routing and deeper analysis of granularity effects in modular LMs.
Abstract
In multi-task learning, the conventional approach involves training a model on multiple tasks simultaneously. However, the training signals from different tasks can interfere with one another, potentially leading to \textit{negative transfer}. To mitigate this, we investigate if modular language models can facilitate positive transfer and systematic generalization. Specifically, we propose a novel modular language model (\texttt{TensorPoly}), that balances parameter efficiency with nuanced routing methods. For \textit{modules}, we reparameterize Low-Rank Adaptation (\texttt{LoRA}) by employing an entangled tensor through the use of tensor product operations and name the resulting approach \texttt{TLoRA}. For \textit{routing function}, we tailor two innovative routing functions according to the granularity: \texttt{TensorPoly-I} which directs to each rank within the entangled tensor while \texttt{TensorPoly-II} offers a finer-grained routing approach targeting each order of the entangled tensor. The experimental results from the multi-task T0-benchmark demonstrate that: 1) all modular LMs surpass the corresponding dense approaches, highlighting the potential of modular language models to mitigate negative inference in multi-task learning and deliver superior outcomes. 2) \texttt{TensorPoly-I} achieves higher parameter efficiency in adaptation and outperforms other modular LMs, which shows the potential of our approach in multi-task transfer learning.
