Table of Contents
Fetching ...

Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free

Ziyue Li, Tianyi Zhou

TL;DR

This study shows that the expert routers in MoE LLMs can serve as an off-the-shelf embedding model with promising performance on a diverse class of embedding-focused tasks, without requiring any finetuning.

Abstract

While large language models (LLMs) excel on generation tasks, their decoder-only architecture often limits their potential as embedding models if no further representation finetuning is applied. Does this contradict their claim of generalists? To answer the question, we take a closer look at Mixture-of-Experts (MoE) LLMs. Our study shows that the expert routers in MoE LLMs can serve as an off-the-shelf embedding model with promising performance on a diverse class of embedding-focused tasks, without requiring any finetuning. Moreover, our extensive analysis shows that the MoE routing weights (RW) is complementary to the hidden state (HS) of LLMs, a widely-used embedding. Compared to HS, we find that RW is more robust to the choice of prompts and focuses on high-level semantics. Motivated by the analysis, we propose MoEE combining RW and HS, which achieves better performance than using either separately. Our exploration of their combination and prompting strategy shed several novel insights, e.g., a weighted sum of RW and HS similarities outperforms the similarity on their concatenation. Our experiments are conducted on 6 embedding tasks with 20 datasets from the Massive Text Embedding Benchmark (MTEB). The results demonstrate the significant improvement brought by MoEE to LLM-based embedding without further finetuning.

Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free

TL;DR

This study shows that the expert routers in MoE LLMs can serve as an off-the-shelf embedding model with promising performance on a diverse class of embedding-focused tasks, without requiring any finetuning.

Abstract

While large language models (LLMs) excel on generation tasks, their decoder-only architecture often limits their potential as embedding models if no further representation finetuning is applied. Does this contradict their claim of generalists? To answer the question, we take a closer look at Mixture-of-Experts (MoE) LLMs. Our study shows that the expert routers in MoE LLMs can serve as an off-the-shelf embedding model with promising performance on a diverse class of embedding-focused tasks, without requiring any finetuning. Moreover, our extensive analysis shows that the MoE routing weights (RW) is complementary to the hidden state (HS) of LLMs, a widely-used embedding. Compared to HS, we find that RW is more robust to the choice of prompts and focuses on high-level semantics. Motivated by the analysis, we propose MoEE combining RW and HS, which achieves better performance than using either separately. Our exploration of their combination and prompting strategy shed several novel insights, e.g., a weighted sum of RW and HS similarities outperforms the similarity on their concatenation. Our experiments are conducted on 6 embedding tasks with 20 datasets from the Massive Text Embedding Benchmark (MTEB). The results demonstrate the significant improvement brought by MoEE to LLM-based embedding without further finetuning.

Paper Structure

This paper contains 14 sections, 7 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Comparison of hidden state (HS) and MoEE (ours) on six types of tasks from the Massive Text Embedding Benchmark (MTEB), where MoEE consistently outperforms HS on all tasks. Both HS and MoEE are extracted from DeepSeekMoE-16B dai2024deepseekmoe without further finetuning.
  • Figure 1: Correlation of the clustering results achieved on the routing weight (RW) and hidden state (HS) embedding extracted from MoE LLMs. Low scores indicate the complementarity of RW and HS.
  • Figure 2: Complementarity of DeepSeekMoE-16B's routing weights (RW) and hidden state (HS) as embedding in the task of similarity ranking on STS12 datasets. In the error analysis of instances where at least one embedding fails, we report the proportion of (1) HS succeeds ✓ and RW fails ✗; (2) HS fails and RW succeeds, and (3) both RW and HS fail. In most cases, the proportion of (1)+(2) exceeds (3), indicating the complementarity of RW and HS.
  • Figure 3: Word clouds of the top-20 topics from 3 clusters achieved on RW and HS separately, highlighting their captured distinct semantic features.
  • Figure 4: Heatmap of Spearman's rank correlation between RW and HS embedding achieved using nine different prompts (defined in Table \ref{['tab:clustering_evaluation']}). The top-left (HS-HS) and bottom-right (RW-RW) blocks show the correlations between embedding when using different prompts, with mean scores of 0.52 and 0.63 (excluding the diagonal entries), respectively. This implies RW is more robust to varying prompts than HS. The top-right and bottom-left blocks reflect correlations between RW and HS when using the same or different prompts, both with a mean score of 0.51. This lowest score indicates the complementarity between RW and HS.
  • ...and 1 more figures