Table of Contents
Fetching ...

Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning

Yassir Benhammou, Alessandro Tiberio, Gabriel Trautmann, Suman Kalyan

TL;DR

This study critically evaluates MILS, a zero-shot image captioning framework built on an iterative LLM-CLIP pipeline, against single-pass models BLIP-2 and GPT-4V. It demonstrates that MILS achieves limited quality gains relative to its substantial computational overhead, with MILS requiring orders of magnitude more runtime and cost than its peers. Single-pass methods deliver competitive or superior caption quality at a fraction of the cost, challenging the claimed training-free advantages of MILS. The findings advocate for efficiency-aware design in multimodal systems and suggest exploring hybrids that combine refinement with streamlined inference for practical deployment.

Abstract

MILS (Multimodal Iterative LLM Solver) is a recently published framework that claims "LLMs can see and hear without any training" by leveraging an iterative, LLM-CLIP based approach for zero-shot image captioning. While this MILS approach demonstrates good performance, our investigation reveals that this success comes at a hidden, substantial computational cost due to its expensive multi-step refinement process. In contrast, alternative models such as BLIP-2 and GPT-4V achieve competitive results through a streamlined, single-pass approach. We hypothesize that the significant overhead inherent in MILS's iterative process may undermine its practical benefits, thereby challenging the narrative that zero-shot performance can be attained without incurring heavy resource demands. This work is the first to expose and quantify the trade-offs between output quality and computational cost in MILS, providing critical insights for the design of more efficient multimodal models.

Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning

TL;DR

This study critically evaluates MILS, a zero-shot image captioning framework built on an iterative LLM-CLIP pipeline, against single-pass models BLIP-2 and GPT-4V. It demonstrates that MILS achieves limited quality gains relative to its substantial computational overhead, with MILS requiring orders of magnitude more runtime and cost than its peers. Single-pass methods deliver competitive or superior caption quality at a fraction of the cost, challenging the claimed training-free advantages of MILS. The findings advocate for efficiency-aware design in multimodal systems and suggest exploring hybrids that combine refinement with streamlined inference for practical deployment.

Abstract

MILS (Multimodal Iterative LLM Solver) is a recently published framework that claims "LLMs can see and hear without any training" by leveraging an iterative, LLM-CLIP based approach for zero-shot image captioning. While this MILS approach demonstrates good performance, our investigation reveals that this success comes at a hidden, substantial computational cost due to its expensive multi-step refinement process. In contrast, alternative models such as BLIP-2 and GPT-4V achieve competitive results through a streamlined, single-pass approach. We hypothesize that the significant overhead inherent in MILS's iterative process may undermine its practical benefits, thereby challenging the narrative that zero-shot performance can be attained without incurring heavy resource demands. This work is the first to expose and quantify the trade-offs between output quality and computational cost in MILS, providing critical insights for the design of more efficient multimodal models.

Paper Structure

This paper contains 12 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Qualitative Comparison of Captions. Each column corresponds to one image from the COCO dataset, and shows the caption outputs generated by GPT-4V, BLIP-2, and MILS. The figure highlights differences in descriptive detail, linguistic fluency, and contextual grounding among the three approaches.