Table of Contents
Fetching ...

ReLope: KL-Regularized LoRA Probes for Multimodal LLM Routing

Yaopei Zeng, Congchao Wang, Blake JianHang Chen, Lu Lin

Abstract

Routing has emerged as a promising strategy for balancing performance and cost in large language model (LLM) systems that combine lightweight models with powerful but expensive large models. Recent studies show that \emph{probe routing}, which predicts the correctness of a small model using its hidden states, provides an effective solution in text-only LLMs. However, we observe that these probes degrade substantially when applied to multimodal LLMs (MLLMs). Through empirical analysis, we find that the presence of visual inputs weakens the separability of correctness signals in hidden states, making them harder to extract using standard probe designs. To address this challenge, we introduce two complementary approaches for improving probe routing in MLLMs. First, we propose the \emph{Attention Probe}, which aggregates hidden states from the preceding layer based on attention scores to recover distributed correctness signals. Second, we present the \emph{KL-Regularized LoRA Probe (ReLope)}, which inserts a lightweight LoRA adapter and applies a KL regularizer to learn routing-aware representations. Comprehensive experiments show that our methods consistently outperform baselines, suggesting that improving the quality of hidden states is key to effective routing in MLLMs. Our code is available at https://github.com/Spinozaaa/ReLope.

ReLope: KL-Regularized LoRA Probes for Multimodal LLM Routing

Abstract

Routing has emerged as a promising strategy for balancing performance and cost in large language model (LLM) systems that combine lightweight models with powerful but expensive large models. Recent studies show that \emph{probe routing}, which predicts the correctness of a small model using its hidden states, provides an effective solution in text-only LLMs. However, we observe that these probes degrade substantially when applied to multimodal LLMs (MLLMs). Through empirical analysis, we find that the presence of visual inputs weakens the separability of correctness signals in hidden states, making them harder to extract using standard probe designs. To address this challenge, we introduce two complementary approaches for improving probe routing in MLLMs. First, we propose the \emph{Attention Probe}, which aggregates hidden states from the preceding layer based on attention scores to recover distributed correctness signals. Second, we present the \emph{KL-Regularized LoRA Probe (ReLope)}, which inserts a lightweight LoRA adapter and applies a KL regularizer to learn routing-aware representations. Comprehensive experiments show that our methods consistently outperform baselines, suggesting that improving the quality of hidden states is key to effective routing in MLLMs. Our code is available at https://github.com/Spinozaaa/ReLope.

Paper Structure

This paper contains 19 sections, 12 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: ReLope pipeline. Given an input sample $(\mathbf{x}, y)$, the MLLM $f_{\theta}$ produces hidden states $\mathbf{Z}^{(l-1)}$ at layer $l-1$. A lightweight LoRA adapter $\theta_{\mathrm{LoRA}}$ is inserted into the model to obtain adapted hidden states $\tilde{\mathbf{Z}}^{(l)}$. The adapted last-token representation $\mathbf{z} = \tilde{\mathbf{z}}^{(l)}_{n}$ is then used as the routing feature. It is fed into the probe $g_{\phi}$ to predict correctness $\hat{y}$ for routing. In parallel, two linear heads map $\mathbf{z}$ to $(\boldsymbol{\mu}, \log \boldsymbol{\sigma}^2)$, which parameterize a Gaussian posterior for a variational bottleneck. The training objective combines cross-entropy loss with a KL regularizer, encouraging the learned representation to preserve correctness-related information while filtering out task-irrelevant variation.
  • Figure 2: Overall system accuracy as a function of routing ratio. The routing ratio denotes the percentage of queries routed to the large model in a hybrid MLLM system. Higher curves indicate better routing strategies that achieve stronger trade-offs bewteen accuracy and cost by sending only the most difficult queries to the large model.
  • Figure 3: Ablation studies of ReLope. (a) Effect of the LoRA rank $r$. (b) Effect of the transformer layer $l$ used for probe placement. (c) Effect of the VIB coefficient $\beta$. Results are reported as AUC on five multimodal benchmarks.