Table of Contents
Fetching ...

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

Wan-Cyuan Fan, Tanzila Rahman, Leonid Sigal

TL;DR

The paper addresses fragmentation in vision-language task solutions by introducing MMFactory, a universal solution search engine. It leverages two routing components: Solution Router to assemble a pool of general, executable solutions from a repository and Metric Router to benchmark them under user-defined constraints. A multi-agent solution proposer, with proposer and committee teams, generates diverse, robust solutions and validates them through iterative feedback and checks. Experiments on BLINK and SeedBench show state-of-the-art performance and lower API costs due to reusable solution pools, making task-specific VLM deployment more practical for non-experts.

Abstract

With advances in foundational and vision-language models, and effective fine-tuning techniques, a large number of both general and special-purpose models have been developed for a variety of visual tasks. Despite the flexibility and accessibility of these models, no single model is able to handle all tasks and/or applications that may be envisioned by potential users. Recent approaches, such as visual programming and multimodal LLMs with integrated tools aim to tackle complex visual tasks, by way of program synthesis. However, such approaches overlook user constraints (e.g., performance / computational needs), produce test-time sample-specific solutions that are difficult to deploy, and, sometimes, require low-level instructions that maybe beyond the abilities of a naive user. To address these limitations, we introduce MMFactory, a universal framework that includes model and metrics routing components, acting like a solution search engine across various available models. Based on a task description and few sample input-output pairs and (optionally) resource and/or performance constraints, MMFactory can suggest a diverse pool of programmatic solutions by instantiating and combining visio-lingual tools from its model repository. In addition to synthesizing these solutions, MMFactory also proposes metrics and benchmarks performance / resource characteristics, allowing users to pick a solution that meets their unique design constraints. From the technical perspective, we also introduced a committee-based solution proposer that leverages multi-agent LLM conversation to generate executable, diverse, universal, and robust solutions for the user. Experimental results show that MMFactory outperforms existing methods by delivering state-of-the-art solutions tailored to user problem specifications. Project page is available at https://davidhalladay.github.io/mmfactory_demo.

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

TL;DR

The paper addresses fragmentation in vision-language task solutions by introducing MMFactory, a universal solution search engine. It leverages two routing components: Solution Router to assemble a pool of general, executable solutions from a repository and Metric Router to benchmark them under user-defined constraints. A multi-agent solution proposer, with proposer and committee teams, generates diverse, robust solutions and validates them through iterative feedback and checks. Experiments on BLINK and SeedBench show state-of-the-art performance and lower API costs due to reusable solution pools, making task-specific VLM deployment more practical for non-experts.

Abstract

With advances in foundational and vision-language models, and effective fine-tuning techniques, a large number of both general and special-purpose models have been developed for a variety of visual tasks. Despite the flexibility and accessibility of these models, no single model is able to handle all tasks and/or applications that may be envisioned by potential users. Recent approaches, such as visual programming and multimodal LLMs with integrated tools aim to tackle complex visual tasks, by way of program synthesis. However, such approaches overlook user constraints (e.g., performance / computational needs), produce test-time sample-specific solutions that are difficult to deploy, and, sometimes, require low-level instructions that maybe beyond the abilities of a naive user. To address these limitations, we introduce MMFactory, a universal framework that includes model and metrics routing components, acting like a solution search engine across various available models. Based on a task description and few sample input-output pairs and (optionally) resource and/or performance constraints, MMFactory can suggest a diverse pool of programmatic solutions by instantiating and combining visio-lingual tools from its model repository. In addition to synthesizing these solutions, MMFactory also proposes metrics and benchmarks performance / resource characteristics, allowing users to pick a solution that meets their unique design constraints. From the technical perspective, we also introduced a committee-based solution proposer that leverages multi-agent LLM conversation to generate executable, diverse, universal, and robust solutions for the user. Experimental results show that MMFactory outperforms existing methods by delivering state-of-the-art solutions tailored to user problem specifications. Project page is available at https://davidhalladay.github.io/mmfactory_demo.

Paper Structure

This paper contains 14 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Illustration of MMFactory. Proposed MMFactory framework (a) contrasted with model routing approaches (c) and multimodal LLM with tools (b). Unlike both prior classes of methods, MMFactory proposes a pool of programmatic solutions, composed of series of selected models from the pool, for a given task while also benchmarking their performance and computational characteristics. See Section \ref{['sec:intro']} for full discussion.
  • Figure 2: Overview of MMFactory. Our framework includes two primary components: Solution Router and Metric Router. The Solution Router generates a pool of potential solutions for the task, while the Metric Router evaluates these solutions, estimating their performance and computational cost to generate a performance curve. This curve enables users to select the model optimal for their task requirements.
  • Figure 3: Illustration of user specification inputs$\mathcal{P}^u$.
  • Figure 4: Illustration of multi-agent conversation. In the solution router, we have two team of agents performing conversation to get the final outputs.
  • Figure 5: Qualitative examples of MMFactory. MMFactory showcases its abilities to use and combine models by automatically constructing better prompts for MLLMs (in Sol 0) and developing solutions with similar logic but utilizing stronger models (in Sol 4).
  • ...and 2 more figures