Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution

N Dinesh Reddy; Dylan Snyder; Lona Kiragu; Mirajul Mohin; Shahrear Bin Amin; Sudeep Pillai

Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution

N Dinesh Reddy, Dylan Snyder, Lona Kiragu, Mirajul Mohin, Shahrear Bin Amin, Sudeep Pillai

TL;DR

Orion tackles the limitations of monolithic vision-language models by introducing an agent-based, tool-augmented visual system that plans, executes, and verifies multi-step workflows across images, video, and documents. It fuses large-language-model-style reasoning with a rich library of domain-specific vision tools to achieve precise, production-grade visual intelligence. Across 46 diverse tasks and benchmarks, Orion demonstrates competitive performance with reduced hallucinations and robust, verifiable outputs, enabling complex workflows such as structured data extraction, cross-modal reasoning, and high-fidelity generation. By bridging neural perception with symbolic execution and providing a schema-driven, API-ready interface, Orion advances practical visual AI for real-world applications and enterprise deployment.

Abstract

We introduce Orion, a visual agent that integrates vision-based reasoning with tool-augmented execution to achieve powerful, precise, multi-step visual intelligence across images, video, and documents. Unlike traditional vision-language models that generate descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition (OCR), and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance across MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic VLM capabilities to production-grade visual intelligence. Through its agentic, tool-augmented approach, Orion enables autonomous visual reasoning that bridges neural perception with symbolic execution, marking the transition from passive visual understanding to active, tool-driven visual intelligence. Try Orion for free at: https://chat.vlm.run Learn more at: https://www.vlm.run/orion

Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution

TL;DR

Abstract

Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)