Mordal: Automated Pretrained Model Selection for Vision Language Models
Shiqi He, Insu Jang, Mosharaf Chowdhury
TL;DR
Mordal tackles the challenge of selecting pretrained vision encoders and language models for vision-language models by reframing model selection as a resource-constrained search. It introduces a two-stage clustering approach based on CKA-derived representation similarity to prune candidates, and combines inter-cluster pruning with intra-cluster evaluation guided by an observational scaling law to predict full-data performance with reduced data. Through extensive experiments across multiple datasets and model zoos, Mordal achieves substantial GPU-hour savings (8.9×–11.6×) while maintaining near-optimal performance and robust top-k rankings, outperforming naive grid search in most tasks. This yields a practical, scalable pathway to deploy task-tuned VLMs and uncover new, strong VLM candidates without exhaustive training. The framework’s efficiency and effectiveness have direct implications for real-world multimodal applications in healthcare, robotics, and accessibility, enabling rapid, data-driven model selection as new pretrained components emerge.
Abstract
Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks. Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility. Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models. We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time required to evaluate each remaining candidate. Our evaluation shows that Mordal can find the best VLM for a given problem using up to $8.9\times$--$11.6\times$ lower GPU hours than grid search. In the process of our evaluation, we have also discovered new VLMs that outperform their state-of-the-art counterparts.
