Table of Contents
Fetching ...

Mordal: Automated Pretrained Model Selection for Vision Language Models

Shiqi He, Insu Jang, Mosharaf Chowdhury

TL;DR

Mordal tackles the challenge of selecting pretrained vision encoders and language models for vision-language models by reframing model selection as a resource-constrained search. It introduces a two-stage clustering approach based on CKA-derived representation similarity to prune candidates, and combines inter-cluster pruning with intra-cluster evaluation guided by an observational scaling law to predict full-data performance with reduced data. Through extensive experiments across multiple datasets and model zoos, Mordal achieves substantial GPU-hour savings (8.9×–11.6×) while maintaining near-optimal performance and robust top-k rankings, outperforming naive grid search in most tasks. This yields a practical, scalable pathway to deploy task-tuned VLMs and uncover new, strong VLM candidates without exhaustive training. The framework’s efficiency and effectiveness have direct implications for real-world multimodal applications in healthcare, robotics, and accessibility, enabling rapid, data-driven model selection as new pretrained components emerge.

Abstract

Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks. Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility. Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models. We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time required to evaluate each remaining candidate. Our evaluation shows that Mordal can find the best VLM for a given problem using up to $8.9\times$--$11.6\times$ lower GPU hours than grid search. In the process of our evaluation, we have also discovered new VLMs that outperform their state-of-the-art counterparts.

Mordal: Automated Pretrained Model Selection for Vision Language Models

TL;DR

Mordal tackles the challenge of selecting pretrained vision encoders and language models for vision-language models by reframing model selection as a resource-constrained search. It introduces a two-stage clustering approach based on CKA-derived representation similarity to prune candidates, and combines inter-cluster pruning with intra-cluster evaluation guided by an observational scaling law to predict full-data performance with reduced data. Through extensive experiments across multiple datasets and model zoos, Mordal achieves substantial GPU-hour savings (8.9×–11.6×) while maintaining near-optimal performance and robust top-k rankings, outperforming naive grid search in most tasks. This yields a practical, scalable pathway to deploy task-tuned VLMs and uncover new, strong VLM candidates without exhaustive training. The framework’s efficiency and effectiveness have direct implications for real-world multimodal applications in healthcare, robotics, and accessibility, enabling rapid, data-driven model selection as new pretrained components emerge.

Abstract

Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks. Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility. Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models. We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time required to evaluate each remaining candidate. Our evaluation shows that Mordal can find the best VLM for a given problem using up to -- lower GPU hours than grid search. In the process of our evaluation, we have also discovered new VLMs that outperform their state-of-the-art counterparts.

Paper Structure

This paper contains 42 sections, 4 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Benchmark performance of five latest open-source VLMs on six multimodal tasks.
  • Figure 2: An overview figure for Mordal. Gray circles and blocks represent pretrained models and VLM candidates, respectively. White blocks represent inactive or eliminated candidates. Mordal first groups similar candidates into clusters, where each candidate consists of one VE and one LLM. During efficient evaluation, every cluster picks one candidate (i.e., inter-cluster) and Mordal evaluates them. Poor-performed clusters are eliminated. For each candidate in the remaining clusters (i.e., intra-cluster), Mordal fits a linear regression model and predicts its performance to select the best candidate.
  • Figure 3: An example showing language model clustering process with four VEs and two LLMs. Different VE clusters will lead to different LLM clusters.
  • Figure 4: Inter- and intra-cluster evaluation with candidate clusters.
  • Figure 5: Applying early stopping mechanism and scaling prediction to inter- and intra-cluster evaluation.
  • ...and 8 more figures