Table of Contents
Fetching ...

Language Models as Black-Box Optimizers for Vision-Language Models

Shihong Liu, Zhiqiu Lin, Samuel Yu, Ryan Lee, Tiffany Ling, Deepak Pathak, Deva Ramanan

TL;DR

The paper presents a truly black-box approach to fine-tune vision-language models by treating chat-based LLMs as prompt optimizers. Using a hill-climbing style loop with exploration and exploitation, and incorporating positive and negative textual feedback, the method yields competitive one-shot CLIP performance across 11 datasets and promotes interpretable, transferable prompts. The framework extends to text-to-image generation with DALL-E 3, achieving improved faithfulness via prompt inversion and personalization, supported by extensive ablations and cross-architecture transferability analyses. Overall, the work demonstrates that language-based prompt optimization can rival white-box methods in extremely low-shot regimes while maintaining a fully black-box workflow.

Abstract

Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. As such, we aim to develop a black-box approach to optimize VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or even output logits. We propose employing chat-based LLMs to search for the best text prompt for VLMs. Specifically, we adopt an automatic hill-climbing procedure that converges to an effective prompt by evaluating the performance of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot image classification setup, our simple approach surpasses the white-box continuous prompting method (CoOp) by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms both human-engineered and LLM-generated prompts. We highlight the advantage of conversational feedback that incorporates both positive and negative prompts, suggesting that LLMs can utilize the implicit gradient direction in textual feedback for a more efficient search. In addition, we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different VLM architectures in a black-box manner. Lastly, we apply our framework to optimize the state-of-the-art black-box VLM (DALL-E 3) for text-to-image generation, prompt inversion, and personalization.

Language Models as Black-Box Optimizers for Vision-Language Models

TL;DR

The paper presents a truly black-box approach to fine-tune vision-language models by treating chat-based LLMs as prompt optimizers. Using a hill-climbing style loop with exploration and exploitation, and incorporating positive and negative textual feedback, the method yields competitive one-shot CLIP performance across 11 datasets and promotes interpretable, transferable prompts. The framework extends to text-to-image generation with DALL-E 3, achieving improved faithfulness via prompt inversion and personalization, supported by extensive ablations and cross-architecture transferability analyses. Overall, the work demonstrates that language-based prompt optimization can rival white-box methods in extremely low-shot regimes while maintaining a fully black-box workflow.

Abstract

Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. As such, we aim to develop a black-box approach to optimize VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or even output logits. We propose employing chat-based LLMs to search for the best text prompt for VLMs. Specifically, we adopt an automatic hill-climbing procedure that converges to an effective prompt by evaluating the performance of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot image classification setup, our simple approach surpasses the white-box continuous prompting method (CoOp) by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms both human-engineered and LLM-generated prompts. We highlight the advantage of conversational feedback that incorporates both positive and negative prompts, suggesting that LLMs can utilize the implicit gradient direction in textual feedback for a more efficient search. In addition, we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different VLM architectures in a black-box manner. Lastly, we apply our framework to optimize the state-of-the-art black-box VLM (DALL-E 3) for text-to-image generation, prompt inversion, and personalization.
Paper Structure (10 sections, 6 figures, 16 tables, 2 algorithms)

This paper contains 10 sections, 6 figures, 16 tables, 2 algorithms.

Figures (6)

  • Figure 1: Prompting VLMs using chat-based LLMs. Similar to how human prompt engineers iteratively test and refine prompts, we employ ChatGPT gpt4chatgpt to continuously optimize prompts for vision-language models (VLMs). Our iterative approach assesses the performance of ChatGPT-generated prompts on a few-shot dataset (highlighted in blue) and provides feedback (marked in violet) to ChatGPT through simple conversations, as depicted in the illustrative figure. This straightforward method delivers state-of-the-art results for one-shot image classification across 11 datasets using CLIP, operated in a black-box manner without accessing model weights, feature embeddings, or output logits. We show that providing both positive (in green) and negative prompts (in red) enhances efficiency. Remarkably, our approach outperforms both white-box methods such as gradient-based continuous prompting (CoOp coop) and human-engineered prompts clip in this extremely low-shot scenario. This figure only shows a typical conversation using ChatGPT's web user interface. Our code implementation follows this pattern using the ChatGPT API. We detail and ablate the prompts in \ref{['sec:appendix_setup']}.
  • Figure 2: Conversational feedback incorporating both positive and negative prompts leads to improved efficiency. We fix the number of restarts to 20 and iterations to 10, and ablate different numbers of resets on all 11 datasets (left) and ImageNet (right). Notably, our approach using "P+N" (both top-15 and bottom-15 prompts) can optimize faster within a much fewer number of resets than using "P-Only" (top-30 prompts), resulting in the highest overall performance.
  • Figure 3: Improving text-to-image (T2I) generation using chat-based multimodal LLMs. We apply our framework to optimize prompts for the state-of-the-art black-box generative VLM, DALL-E 3 dalle3, using the multimodal GPT4-V gpt4. For complicated user queries that DALL-E 3 may initially fail to generate, we send the generated image (in violet) along with the current prompt to GPT4-V to ask for feedback on improvements (in red) and then generate a new prompt (in blue). We show that such a simple framework is surprisingly effective at correcting DALL-E 3 mistakes on some challenging Winoground winoground text queries that involve action, logical, and spatial reasoning. We conduct a human evaluation on the quality of generated images in \ref{['tab:human_study']} and include the actual prompts in \ref{['sec:appendix_setup']}. We open-source our code at llm-can-optimize-vlm.github.io to facilitate future research on AI-driven content generation.
  • Figure 4: Prompt inversion using chat-based multimodal LLMs. We apply our framework to reverse engineer the text prompt to generate the same user-queried image. We send the generated image (in violet) along with the original image to GPT4-V to ask for feedback on improvements (in red) and then generate a new prompt (in blue). The final reversed-engineered text prompt allows users to readily perform personalized (customized) generation (see \ref{['tab:customization']}).
  • Figure 5: Updating initial prompts can be as effective as multi-turn conversation. We ablate different ways of conversing with ChatGPT on all 11 datasets (left) and ImageNet (right). Notably, we find that only updating the top-k and bottom-k prompts ( Iterative) is as performant and thus a cheaper alternative because sending response to ChatGPT costs more input tokens. On the other hand, reusing the initial prompts ( Non-Iterative) leads to worse overall performance.
  • ...and 1 more figures