Table of Contents
Fetching ...

Adaptive Vision-Language Model Routing for Computer Use Agents

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

Abstract

Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textit{warm} agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large models, allowing many actions to be handled without escalation. We formalize routing as a cost--accuracy trade-off, derive a threshold-based policy for model selection, and evaluate AVR using ScreenSpot-Pro grounding data together with the OpenClaw agent routing benchmark. Across these settings, AVR projects inference cost reductions of up to 78\% while staying within 2 percentage points of an all-large-model baseline. When combined with the Visual Confused Deputy guardrail, AVR also escalates high-risk actions directly to the strongest available model, unifying efficiency and safety within a single routing framework. Materials are also provided Model, benchmark, and code: https://github.com/vllm-project/semantic-router.

Adaptive Vision-Language Model Routing for Computer Use Agents

Abstract

Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textit{warm} agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large models, allowing many actions to be handled without escalation. We formalize routing as a cost--accuracy trade-off, derive a threshold-based policy for model selection, and evaluate AVR using ScreenSpot-Pro grounding data together with the OpenClaw agent routing benchmark. Across these settings, AVR projects inference cost reductions of up to 78\% while staying within 2 percentage points of an all-large-model baseline. When combined with the Visual Confused Deputy guardrail, AVR also escalates high-risk actions directly to the strongest available model, unifying efficiency and safety within a single routing framework. Materials are also provided Model, benchmark, and code: https://github.com/vllm-project/semantic-router.
Paper Structure (38 sections, 16 equations, 9 figures, 5 tables)

This paper contains 38 sections, 16 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The CUA action loop. Each iteration requires a full VLM inference call carrying screenshots (2--5K tokens each) plus context. A 20-step task accumulates ${\sim}$400K input tokens.
  • Figure 2: AVR architecture. The semantic router intercepts CUA tool calls, classifies difficulty via the multimodal embedder, probes the small VLM for confidence, injects memories for warm agents, and routes to the cheapest sufficient model. The contrastive KB (from the Visual Confused Deputy guardrail) can force escalation for high-risk actions.
  • Figure 3: AVR routing decision flowchart. Each tool call passes through safety check, difficulty classification, and confidence probing. Easy actions skip the probe entirely; risky actions go directly to the large model with guardrail verification. Memory injection (dashed) augments the probe for warm agents.
  • Figure 4: Three-tier routing policy (illustrative traffic shares from warm-agent projection). Most actions stay on the cheap small VLM. Hard or uncertain actions escalate to the large VLM. Safety-flagged actions go to the large VLM with additional guardrail verification.
  • Figure 5: Model size vs. grounding accuracy on ScreenSpot-Pro (log scale). Generalist models (squares) cluster near zero regardless of size. Within Qwen2.5-VL family (triangles, dashed line), accuracy grows sublinearly. GUI specialists (circles) at 7B outperform generalists 100$\times$ their size.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Definition 1: Memory Equalization