Table of Contents
Fetching ...

Adapter-Augmented Bandits for Online Multi-Constrained Multi-Modal Inference Scheduling

Xianzhi Zhang, Yue Xu, Yinlin Zhu, Di Wu, Yipeng Zhou, Miao Hu, Guocong Quan

TL;DR

A regret guarantee under multi-dimensional knapsack constraints is established and a multi-adapter-enhanced MLLM inference scheduling framework is proposed, which consistently outperforms state-of-the-art baselines across budget regimes.

Abstract

Multi-modal large language model (MLLM) inference scheduling enables strong response quality under practical and heterogeneous budgets, beyond what a homogeneous single-backend setting can offer. Yet online MLLM task scheduling is nontrivial, as requests vary sharply in modality composition and latent reasoning difficulty, while execution backends incur distinct, time-varying costs due to system jitter and network variation. These coupled uncertainties pose two core challenges: deriving semantically faithful yet scheduling-relevant multi-modal task representations, and making low-overhead online decisions over irreversible multi-dimensional budgets. Accordingly, we propose \emph{M-CMAB} (\underline{M}ulti-modal \underline{M}ulti-constraint \underline{C}ontextual \underline{M}ulti-\underline{A}rmed \underline{B}andit), a multi-adapter-enhanced MLLM inference scheduling framework with three components: (i) a CLS-attentive, frozen-backbone \emph{Predictor} that extracts compact task representations and updates only lightweight adapters for action-specific estimation; (ii) a primal-dual \emph{Constrainer} that maintains online Lagrange multipliers to enforce long-horizon constraints via per-round objectives; and (iii) a two-phase \emph{Scheduler} that balances exploration and exploitation under irreversible budgets. We establish a regret guarantee under multi-dimensional knapsack constraints. On a composite multimodal benchmark with heterogeneous backends, \emph{M-CMAB} consistently outperforms state-of-the-art baselines across budget regimes, achieving up to 14.18% higher reward and closely tracking an oracle-aided upper bound. Codes are available at https://anonymous.4open.science/r/M2CMAB/.

Adapter-Augmented Bandits for Online Multi-Constrained Multi-Modal Inference Scheduling

TL;DR

A regret guarantee under multi-dimensional knapsack constraints is established and a multi-adapter-enhanced MLLM inference scheduling framework is proposed, which consistently outperforms state-of-the-art baselines across budget regimes.

Abstract

Multi-modal large language model (MLLM) inference scheduling enables strong response quality under practical and heterogeneous budgets, beyond what a homogeneous single-backend setting can offer. Yet online MLLM task scheduling is nontrivial, as requests vary sharply in modality composition and latent reasoning difficulty, while execution backends incur distinct, time-varying costs due to system jitter and network variation. These coupled uncertainties pose two core challenges: deriving semantically faithful yet scheduling-relevant multi-modal task representations, and making low-overhead online decisions over irreversible multi-dimensional budgets. Accordingly, we propose \emph{M-CMAB} (\underline{M}ulti-modal \underline{M}ulti-constraint \underline{C}ontextual \underline{M}ulti-\underline{A}rmed \underline{B}andit), a multi-adapter-enhanced MLLM inference scheduling framework with three components: (i) a CLS-attentive, frozen-backbone \emph{Predictor} that extracts compact task representations and updates only lightweight adapters for action-specific estimation; (ii) a primal-dual \emph{Constrainer} that maintains online Lagrange multipliers to enforce long-horizon constraints via per-round objectives; and (iii) a two-phase \emph{Scheduler} that balances exploration and exploitation under irreversible budgets. We establish a regret guarantee under multi-dimensional knapsack constraints. On a composite multimodal benchmark with heterogeneous backends, \emph{M-CMAB} consistently outperforms state-of-the-art baselines across budget regimes, achieving up to 14.18% higher reward and closely tracking an oracle-aided upper bound. Codes are available at https://anonymous.4open.science/r/M2CMAB/.
Paper Structure (33 sections, 3 theorems, 30 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 33 sections, 3 theorems, 30 equations, 5 figures, 4 tables, 2 algorithms.

Key Result

Theorem 4.1

Let $T$ denote the total number of scheduling rounds, $T_0$ the number of times each action is selected in the initial phase, and $A$ the number of actions. Define $\Phi_{\mathrm{min}}=|\bm{\Phi}^{-1}|_\infty$. Then, we have: where $\Phi_{\mathrm{min}} > \max\left\{(A+2)\,T_0,\; T\,\mathcal{M}(T_0)\right\}$ characterizes the feasible choice of $T_0$ and ensures that all budget constraints are sat

Figures (5)

  • Figure 1: Comparison of reward, latency, and monetary cost across MLLM inference backends on a mixed multi-modal task trace.
  • Figure 2: Overview of the M$^2$-CMAB decision-making pipeline.
  • Figure 3: Average inference reward of M$^2$-CMAB and baselines across six datasets and three budget regimes.
  • Figure 4: Sensitivity of M$^2$-CMAB to the initial phase ratio, i.e., $\frac{(A+1)T_0}{T}$, evaluated by average inference reward (y-axis). Bars indicate the percentage of total rounds allocated to the initial phase, with three groups per subfigure corresponding to three budget regimes.
  • Figure 6: Comparison of reward, latency, and monetary cost across MLLM inference backends on individual datasets and a mixed multi-modal task trace.

Theorems & Definitions (9)

  • Theorem 4.1
  • proof
  • Remark 4.2: Discussions and Open Issues
  • proof
  • Definition 4.2: Optimal Static Benchmark Value
  • Definition 4.3: Cumulative Regret of the Online Algorithm
  • Definition 4.4: Squared-Loss Estimation Regret
  • Lemma 4.5: Regret Bound of SquareCBwK, Theorem A.1 DBLP:conf/aistats/HanZWXZ23
  • Lemma 4.6: Accuracy of the Estimated Dual Radius, Lemma 4.1 DBLP:conf/aistats/HanZWXZ23