Table of Contents
Fetching ...

Mixture of Tunable Experts -- Behavior Modification of DeepSeek-R1 at Inference Time

Robert Dahlke, Henrik Klagges, Dan Zecha, Benjamin Merkel, Sven Rohr, Fabian Klemm

TL;DR

This work addresses how to steer large language models at inference time without retraining by introducing Mixture of Tunable Experts (MoTE), an extension of MoE that allows overriding router decisions. By coupling MoTE with Functional Token Resonance Imaging (fTRI), the authors localize alignment-related behavior to small subsets of routed experts in DeepSeek-R1 and selectively suppress or stimulate them to modify outputs such as refusals or the language used for reasoning. Key findings show that suppressing a tiny fraction of distinctive experts drastically reduces refusals (up to 52% on a broader sensitive-topic dataset) without harming general performance, while stimulation can modulate behavior in predictable directions. The results imply improved interpretability and controllability of MoE-based models and suggest that critical behavioral mechanisms may be localized rather than distributed across weights, enabling targeted safety and behavior adjustments in deployment settings.

Abstract

We present the Mixture-of-Tunable-Experts (MoTE), a method that extends the Mixture-of-Experts architecture of Large Language Models (LLMs). Without additional training, MoTE enables meaningful and focused behavior changes in LLMs on-the-fly during inference time. By analyzing the digital LLM brain of DeepSeek-R1 using a technique we dub 'functional Token Resonance Imaging' (fTRI) -- inspired by fMRI and using prompts designed to elicit specific behavior (e.g., 'What happened {time}{place}?') -- we empirically identify distinctive experts associated with behaviors like refusal responses. Using MoTE we are able to intervene and control such specific behavior. We switched off the top 10 most refusal-relevant experts (0.07% of R1's 14,848 routed experts), achieving a 52% refusal reduction on sensitive reference prompts without performance degradation on MT-Bench. Random expert deactivation resulted in smaller behavioral shifts with increased noise, whereas forced expert activation led to significantly higher refusal rates. Our approach shares similarities with sparse autoencoders (SAEs) in terms of explainability and steerability. Unlike SAEs, MoTE does not require large training efforts, as within MoEs with a vast number of experts, specialization already emerged naturally during pretraining. Our findings suggest that significant functional mechanisms in Mixture-of-Experts architectures can at least partially be localized in a small number of specific experts, rather than being distributed throughout the model's weights. Expert subgroups can be tuned to trigger significant behavior variations, providing insights into the inner workings of LLMs.

Mixture of Tunable Experts -- Behavior Modification of DeepSeek-R1 at Inference Time

TL;DR

This work addresses how to steer large language models at inference time without retraining by introducing Mixture of Tunable Experts (MoTE), an extension of MoE that allows overriding router decisions. By coupling MoTE with Functional Token Resonance Imaging (fTRI), the authors localize alignment-related behavior to small subsets of routed experts in DeepSeek-R1 and selectively suppress or stimulate them to modify outputs such as refusals or the language used for reasoning. Key findings show that suppressing a tiny fraction of distinctive experts drastically reduces refusals (up to 52% on a broader sensitive-topic dataset) without harming general performance, while stimulation can modulate behavior in predictable directions. The results imply improved interpretability and controllability of MoE-based models and suggest that critical behavioral mechanisms may be localized rather than distributed across weights, enabling targeted safety and behavior adjustments in deployment settings.

Abstract

We present the Mixture-of-Tunable-Experts (MoTE), a method that extends the Mixture-of-Experts architecture of Large Language Models (LLMs). Without additional training, MoTE enables meaningful and focused behavior changes in LLMs on-the-fly during inference time. By analyzing the digital LLM brain of DeepSeek-R1 using a technique we dub 'functional Token Resonance Imaging' (fTRI) -- inspired by fMRI and using prompts designed to elicit specific behavior (e.g., 'What happened {time}{place}?') -- we empirically identify distinctive experts associated with behaviors like refusal responses. Using MoTE we are able to intervene and control such specific behavior. We switched off the top 10 most refusal-relevant experts (0.07% of R1's 14,848 routed experts), achieving a 52% refusal reduction on sensitive reference prompts without performance degradation on MT-Bench. Random expert deactivation resulted in smaller behavioral shifts with increased noise, whereas forced expert activation led to significantly higher refusal rates. Our approach shares similarities with sparse autoencoders (SAEs) in terms of explainability and steerability. Unlike SAEs, MoTE does not require large training efforts, as within MoEs with a vast number of experts, specialization already emerged naturally during pretraining. Our findings suggest that significant functional mechanisms in Mixture-of-Experts architectures can at least partially be localized in a small number of specific experts, rather than being distributed throughout the model's weights. Expert subgroups can be tuned to trigger significant behavior variations, providing insights into the inner workings of LLMs.

Paper Structure

This paper contains 15 sections, 14 figures, 1 table.

Figures (14)

  • Figure 1: MoTE as an extension to the DeepSeekMoE architecture (Illustration taken from deepseekai2024deepseekv3 and modified). Shared experts are always activated. Originally, only the top-k routed experts selected by the Router's output get activated. With MoTE, there's an added flexibility: individual routed experts can be tuned by manually overriding the Router's output at inference time.
  • Figure 2: Functional Token Resonance Imaging (fTRI) visualizes the expert activation of an exemplary single token. For each gated layer in the network, the corresponding row depicts which 8 of the 256 routed experts got activated.
  • Figure 3: fTRI map that visualizes the expert activation pattern aggregated over all tokens in an input prompt. Each row depicts the activations of the layer experts as in \ref{['fig:ftri:singletoken']}, summed over all prompt tokens. A higher signal indicates that a given expert was selected for more input tokens.
  • Figure 4: Mapping expert activation patterns for different prompts into two dimensions using t-SNE reveals distinct clusters of similar activation patterns. E.g. cluster 2 represents questions about recent events (see text for details on clusters C1 - C4).
  • Figure 5: fTRI map of differential activations for refused prompts. Highly distinctive experts (highlighted with red circles, e.g. routed expert with expert-id 176, layer-id 48) are chosen as candidates for tuning, their exact coordinates are listed in \ref{['sec:appendix:topexperts']}.
  • ...and 9 more figures