Mixture of Latent Experts Using Tensor Products

Zhan Su; Fengran Mo; Prayag Tiwari; Benyou Wang; Jian-Yun Nie; Jakob Grue Simonsen

Mixture of Latent Experts Using Tensor Products

Zhan Su, Fengran Mo, Prayag Tiwari, Benyou Wang, Jian-Yun Nie, Jakob Grue Simonsen

TL;DR

The paper tackles negative transfer in multi-task learning by introducing TensorPoly, a modular language model that uses tensor-product based adapters (TLoRA) and two routing schemes to enable richer cross-task sharing with high parameter efficiency. It defines and combines LoRA, Poly, and an entangled-tensor formulation to support higher-order interactions, presenting two routing variants (TensorPoly-I and TensorPoly-II) with distinct granularity. Experiments on the T0 multi-task benchmark show modular LMs outperform dense baselines, with TensorPoly-I delivering strong performance and exceptional parameter efficiency, while TensorPoly-II provides limited gains. The work highlights the importance of routing over mere parameter addition in latent-expert approaches and points to future directions for domain-specific tensor routing and deeper analysis of granularity effects in modular LMs.

Abstract

In multi-task learning, the conventional approach involves training a model on multiple tasks simultaneously. However, the training signals from different tasks can interfere with one another, potentially leading to \textit{negative transfer}. To mitigate this, we investigate if modular language models can facilitate positive transfer and systematic generalization. Specifically, we propose a novel modular language model (\texttt{TensorPoly}), that balances parameter efficiency with nuanced routing methods. For \textit{modules}, we reparameterize Low-Rank Adaptation (\texttt{LoRA}) by employing an entangled tensor through the use of tensor product operations and name the resulting approach \texttt{TLoRA}. For \textit{routing function}, we tailor two innovative routing functions according to the granularity: \texttt{TensorPoly-I} which directs to each rank within the entangled tensor while \texttt{TensorPoly-II} offers a finer-grained routing approach targeting each order of the entangled tensor. The experimental results from the multi-task T0-benchmark demonstrate that: 1) all modular LMs surpass the corresponding dense approaches, highlighting the potential of modular language models to mitigate negative inference in multi-task learning and deliver superior outcomes. 2) \texttt{TensorPoly-I} achieves higher parameter efficiency in adaptation and outperforms other modular LMs, which shows the potential of our approach in multi-task transfer learning.

Mixture of Latent Experts Using Tensor Products

TL;DR

Abstract

Paper Structure (32 sections, 9 equations, 9 figures, 6 tables)

This paper contains 32 sections, 9 equations, 9 figures, 6 tables.

Introduction
Related Work
Background
Module: LoRA
Polytropon (Poly): Mixture of Latent Experts with Linear Combination
Tensor, Tensor Product, Entangled Tensor
Tensor.
Tensor Product.
Entangled Tensor.
Tensorized Vector using Entangled Tensor
Methods: TensorPoly
TLoRA
TensorPoly: Mixture of Latent Experts using Tensor Products
Experiments
Datasets and Evaluation
...and 17 more sections

Figures (9)

Figure 1: Left: Comparison between the dense models (LoRA, TLoRA) and latent-expert approaches (Poly, MHR, TensorPoly-I, TensorPoly-II). Poly/MHR use LoRA as the modules, TensorPoly-I and TensorPoly-II use TLoRA as the modules. Right: Adaptation parameters across different approaches in the fine-tuning process.
Figure 2: Compare with three training paradigms in multi-task transfer learning. Left is the private training, for each task, we train the corresponding expert individually. Middle is the shared version, for all the tasks, we train an expert continually, and as a result, we only get one expert. Right is the latent experts model, for all the tasks, we train a subset of "latent" experts, so each corresponding expert can be seen as a linear combination of these latent experts.
Figure 3: TensorPoly-I and TensorPoly-II. We illustrate how to reparameterize the LoRA matrix $\mathbb{R}^{625\times 5}$ with 4 tensor $\mathcal{A}\in\mathbb{R}^{3\times 5\times 5}$. In this case, the tensor rank $R=3$, tensor order $N=4$. For TensorPoly-I, the routing function $\mathbf{Z}$ is designed to select which rank of the entangled tensor is activated for a given task. Conversely, TensorPoly-II introduces a more granular control by selecting tensor rank and tensor order.
Figure 4: TensorPoly-X. We illustrate how to reparameterize the full-rank LoRA matrix $\Delta W \in \mathbb{R}^{625\times 625}$ with 4 tensors in a tensor train format. In this case, the tensor rank $R=3$ corresponds to the number of latent experts at each of a total number of $N=3$ levels. The routing function $\mathbf{Z} \in \mathbb{R}^{|\mathcal{T}|\times N \times R}$ is designed to select the activated rank for a given task at each level.
Figure 5: Rank analysis in the TensorPoly-I, Left denotes the average accuracy over 11 held-out tasks according to different rank. Right is the validation loss in the multi-task pre-training process.
...and 4 more figures

Mixture of Latent Experts Using Tensor Products

TL;DR

Abstract

Mixture of Latent Experts Using Tensor Products

Authors

TL;DR

Abstract

Table of Contents

Figures (9)