Table of Contents
Fetching ...

Olympus: A Universal Task Router for Computer Vision Tasks

Yuanze Lin, Yunsheng Li, Dongdong Chen, Weijian Xu, Ronald Clark, Philip H. S. Torr

TL;DR

Olympus addresses the challenge of unifying diverse vision tasks across images, video, and 3D content without training massive generative models. It uses a Multimodal LLM as a controller to route tasks to external specialist modules via explicit routing tokens, enabling single-instruction chains of actions. The authors introduce OlympusInstruct and OlympusBench, with 446.3K training and 49.6K evaluation samples across 20 tasks, and demonstrate routing accuracy of 94.75% and chain-of-action precision of 91.82%, competitive with leading MLLMs on standard benchmarks. The modular routing approach offers scalability and flexibility for incorporating new tasks and models while keeping training costs manageable.

Abstract

We introduce Olympus, a new approach that transforms Multimodal Large Language Models (MLLMs) into a unified framework capable of handling a wide array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates over 20 specialized tasks across images, videos, and 3D objects to dedicated modules. This instruction-based routing enables complex workflows through chained actions without the need for training heavy generative models. Olympus easily integrates with existing MLLMs, expanding their capabilities with comparable performance. Experimental results demonstrate that Olympus achieves an average routing accuracy of 94.75% across 20 tasks and precision of 91.82% in chained action scenarios, showcasing its effectiveness as a universal task router that can solve a diverse range of computer vision tasks. Project page: http://yuanze-lin.me/Olympus_page/

Olympus: A Universal Task Router for Computer Vision Tasks

TL;DR

Olympus addresses the challenge of unifying diverse vision tasks across images, video, and 3D content without training massive generative models. It uses a Multimodal LLM as a controller to route tasks to external specialist modules via explicit routing tokens, enabling single-instruction chains of actions. The authors introduce OlympusInstruct and OlympusBench, with 446.3K training and 49.6K evaluation samples across 20 tasks, and demonstrate routing accuracy of 94.75% and chain-of-action precision of 91.82%, competitive with leading MLLMs on standard benchmarks. The modular routing approach offers scalability and flexibility for incorporating new tasks and models while keeping training costs manageable.

Abstract

We introduce Olympus, a new approach that transforms Multimodal Large Language Models (MLLMs) into a unified framework capable of handling a wide array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates over 20 specialized tasks across images, videos, and 3D objects to dedicated modules. This instruction-based routing enables complex workflows through chained actions without the need for training heavy generative models. Olympus easily integrates with existing MLLMs, expanding their capabilities with comparable performance. Experimental results demonstrate that Olympus achieves an average routing accuracy of 94.75% across 20 tasks and precision of 91.82% in chained action scenarios, showcasing its effectiveness as a universal task router that can solve a diverse range of computer vision tasks. Project page: http://yuanze-lin.me/Olympus_page/

Paper Structure

This paper contains 29 sections, 1 equation, 11 figures, 17 tables.

Figures (11)

  • Figure 1: With its versatile capabilities, Olympus addresses a broad spectrum of vision tasks across images, videos, and even 3D content, in fact, it can cover over 20 different tasks.
  • Figure 2: Given the user prompts, a trainable MLLM can perform routing across a wide range of specified models. In this concept, MLLMs can solve multimodal understanding tasks (e.g., VQA) with its inherited capacity, while MLLMs can allocate appropriate specialized models to address multimodal generative and classic vision tasks (e.g., image generation and depth estimation), then aggregate the results and deliver a response to the user.
  • Figure 3: The framework of Olympus. It can solve those tasks like VQA through the inherited capacities of MLLM directly. For other tasks, e.g., image editing, Olympus can generate the response, which consists of task-specific routing tokens and refined prompts, they are then used to schedule specialist models for addressing diverse user requests.
  • Figure 4: One example to illustrate the prompt we use to generate the instruction-response pairs for image editing by GPT-4o.
  • Figure 5: The statistic of the collected dataset. Note that CVG and CIG denote controllable video generation and controllable image generation, RVOS represents referring video object segmentation and image SR means image super-resolution in figure (b).
  • ...and 6 more figures