Table of Contents
Fetching ...

One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries

Mayank Saini Arit Kumar Bishwas

Abstract

We present an agentic AI framework for autonomous multimodal query processing that coordinates specialized tools across text, image, audio, video, and document modalities. A central Supervisor dynamically decomposes user queries, delegates subtasks to modality-appropriate tools (e.g., object detection, OCR, speech transcription), and synthesizes results through adaptive routing strategies rather than predetermined decision trees. For text-only queries, the framework uses learned routing via RouteLLM, while non-text paths use SLM-assisted modality decomposition. Evaluated on 2,847 queries across 15 task categories, our framework achieves 72% reduction in time-to-accurate-answer, 85% reduction in conversational rework, and 67% cost reduction compared to the matched hierarchical baseline while maintaining accuracy parity. These results demonstrate that intelligent centralized orchestration fundamentally improves multimodal AI deployment economics.

One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries

Abstract

We present an agentic AI framework for autonomous multimodal query processing that coordinates specialized tools across text, image, audio, video, and document modalities. A central Supervisor dynamically decomposes user queries, delegates subtasks to modality-appropriate tools (e.g., object detection, OCR, speech transcription), and synthesizes results through adaptive routing strategies rather than predetermined decision trees. For text-only queries, the framework uses learned routing via RouteLLM, while non-text paths use SLM-assisted modality decomposition. Evaluated on 2,847 queries across 15 task categories, our framework achieves 72% reduction in time-to-accurate-answer, 85% reduction in conversational rework, and 67% cost reduction compared to the matched hierarchical baseline while maintaining accuracy parity. These results demonstrate that intelligent centralized orchestration fundamentally improves multimodal AI deployment economics.
Paper Structure (8 sections, 10 equations, 5 figures, 4 tables)

This paper contains 8 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Architectural paradigm comparison showing the transition from scripted routing with brittle global restarts to Supervisor-driven adaptive orchestration with tool and model pools.
  • Figure 2: Centralized orchestration architecture coordinating specialized tools across modalities through dynamic task decomposition and delegation.
  • Figure 3: Modality-specific memory architecture with hierarchical layers and unified context scoring managed by the central orchestrator.
  • Figure 4: Performance improvements across modalities showing consistent 65--77% TTA and 82--89% rework reduction.
  • Figure 5: Cost analysis by tool category showing 62--85% reduction with Memory Tools achieving highest efficiency.