Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning
Yassir Benhammou, Alessandro Tiberio, Gabriel Trautmann, Suman Kalyan
TL;DR
This study critically evaluates MILS, a zero-shot image captioning framework built on an iterative LLM-CLIP pipeline, against single-pass models BLIP-2 and GPT-4V. It demonstrates that MILS achieves limited quality gains relative to its substantial computational overhead, with MILS requiring orders of magnitude more runtime and cost than its peers. Single-pass methods deliver competitive or superior caption quality at a fraction of the cost, challenging the claimed training-free advantages of MILS. The findings advocate for efficiency-aware design in multimodal systems and suggest exploring hybrids that combine refinement with streamlined inference for practical deployment.
Abstract
MILS (Multimodal Iterative LLM Solver) is a recently published framework that claims "LLMs can see and hear without any training" by leveraging an iterative, LLM-CLIP based approach for zero-shot image captioning. While this MILS approach demonstrates good performance, our investigation reveals that this success comes at a hidden, substantial computational cost due to its expensive multi-step refinement process. In contrast, alternative models such as BLIP-2 and GPT-4V achieve competitive results through a streamlined, single-pass approach. We hypothesize that the significant overhead inherent in MILS's iterative process may undermine its practical benefits, thereby challenging the narrative that zero-shot performance can be attained without incurring heavy resource demands. This work is the first to expose and quantify the trade-offs between output quality and computational cost in MILS, providing critical insights for the design of more efficient multimodal models.
