Table of Contents
Fetching ...

LLMs can see and hear without any training

Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar

TL;DR

MILS presents a training-free, test-time framework that endows LLMs with multimodal perception and generation by wiring a Generator (LLM) to propose candidates and a Scorer (multimodal model) to evaluate them, iteratively refining outputs over $N$ steps. This gradient-free, modular loop yields emergent zero-shot capabilities across images, video, audio, and generation/editing tasks, including cross-modal arithmetic via embedding inversion to text. Across image, video, and audio captioning, as well as high-quality image generation and style transfer, MILS matches or surpasses task-specific baselines without any training data curation. The approach highlights a practical, scalable path to multimodal AI, with potential extensions to additional modalities and tasks as LLMs and multimodal encoders improve.

Abstract

We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.

LLMs can see and hear without any training

TL;DR

MILS presents a training-free, test-time framework that endows LLMs with multimodal perception and generation by wiring a Generator (LLM) to propose candidates and a Scorer (multimodal model) to evaluate them, iteratively refining outputs over steps. This gradient-free, modular loop yields emergent zero-shot capabilities across images, video, audio, and generation/editing tasks, including cross-modal arithmetic via embedding inversion to text. Across image, video, and audio captioning, as well as high-quality image generation and style transfer, MILS matches or surpasses task-specific baselines without any training data curation. The approach highlights a practical, scalable path to multimodal AI, with potential extensions to additional modalities and tasks as LLMs and multimodal encoders improve.

Abstract

We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.

Paper Structure

This paper contains 22 sections, 16 figures, 10 tables.

Figures (16)

  • Figure 1: Our proposed approach, MILS, enables various applications, from captioning images, video, or audio; improving text-to-image generation; image editing such as style transfer; as well as arithmetic across different modalities by inverting them all into text. It accomplishes all this using a purely test-time optimization approach without any task specific training or data curation!
  • Figure 2: MILS leverages two key modules, Generator and Scorer, to solve multimodal tasks. The Generator will generate a number of text candidates, e.g. captions for image captioning and prompts for T2I, each of which will be scored by the Scorer, and passed back into the Generator as feedback to generate the next batch of text candidates, eventually producing the final output for the input test sample.
  • Figure 3: Image Captioning using MILS, compared to existing state-of-the-art zero-shot approach, MeaCap zeng2024meacap. MILS, while being a much simpler approach, produces more accurate and syntactically correct captions to the image.
  • Figure 4: Improved text-to-image (T2I) generation using MILS. We apply MILS to two of the latest, state-of-the-art T2I models, a latent diffusion model (LDM), and FLUX.1 [schnell] (FLUX). We compare MILS's outputs to the generations from the initial models using human annotators. Evaluated over the 200 prompt DrawBench dataset, the annotators clearly preferred MILS's generations on both overall quality and text faithfulness, across both models.
  • Figure 5: Improving image generation using MILS. Applying MILS to a Generator using the same base model, a Latent Diffusion Model (LDM) in this case, leads to much higher quality images. We show the original input prompt, the generation from the base model, and from MILS.
  • ...and 11 more figures