LLMs can see and hear without any training
Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar
TL;DR
MILS presents a training-free, test-time framework that endows LLMs with multimodal perception and generation by wiring a Generator (LLM) to propose candidates and a Scorer (multimodal model) to evaluate them, iteratively refining outputs over $N$ steps. This gradient-free, modular loop yields emergent zero-shot capabilities across images, video, audio, and generation/editing tasks, including cross-modal arithmetic via embedding inversion to text. Across image, video, and audio captioning, as well as high-quality image generation and style transfer, MILS matches or surpasses task-specific baselines without any training data curation. The approach highlights a practical, scalable path to multimodal AI, with potential extensions to additional modalities and tasks as LLMs and multimodal encoders improve.
Abstract
We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.
