Table of Contents
Fetching ...

Learning How To Ask: Cycle-Consistency Refines Prompts in Multimodal Foundation Models

Maurice Diesendruck, Jianzhe Lin, Shima Imani, Gayathri Mahalingam, Mingyang Xu, Jie Zhao

TL;DR

CyclePrompt offers a self-supervised, cycle-consistent approach to refining prompts in multimodal foundation models by coupling a forward and backward mapping via a discriminator to produce in-context hints. Without training data or external tools, it achieves state-of-the-art results on HumanEval among unassisted models and delivers competitive image-captioning performance that surpasses zero-shot baselines on VQAv2 and FigureQA. The method demonstrates that cycle-consistency can provide a powerful supervisory signal for prompt design across code and vision-language tasks, revealing when and why forward, backward, and discriminator components influence gains. This work opens a pathway for purely in-context, self-refining prompting with broad applicability in multimodal AI systems.

Abstract

When LLMs perform zero-shot inference, they typically use a prompt with a task specification, and generate a completion. However, there is no work to explore the possibility of the reverse - going from completion to task specification. In this paper, we employ both directions to perform cycle-supervised learning entirely in-context. Our goal is to create a forward map f : X -> Y (e.g. image -> generated caption), coupled with a backward map g : Y -> X (e.g. caption -> generated image) to construct a cycle-consistency "loss" (formulated as an update to the prompt) to enforce g(f(X)) ~= X. The technique, called CyclePrompt, uses cycle-consistency as a free supervisory signal to iteratively craft the prompt. Importantly, CyclePrompt reinforces model performance without expensive fine-tuning, without training data, and without the complexity of external environments (e.g. compilers, APIs). We demonstrate CyclePrompt in two domains: code generation and image captioning. Our results on the HumanEval coding benchmark put us in first place on the leaderboard among models that do not rely on extra training data or usage of external environments, and third overall. Compared to the GPT4 baseline, we improve accuracy from 80.5% to 87.2%. In the vision-language space, we generate detailed image captions which outperform baseline zero-shot GPT4V captions, when tested against natural (VQAv2) and diagrammatic (FigureQA) visual question-answering benchmarks. To the best of our knowledge, this is the first use of self-supervised learning for prompting.

Learning How To Ask: Cycle-Consistency Refines Prompts in Multimodal Foundation Models

TL;DR

CyclePrompt offers a self-supervised, cycle-consistent approach to refining prompts in multimodal foundation models by coupling a forward and backward mapping via a discriminator to produce in-context hints. Without training data or external tools, it achieves state-of-the-art results on HumanEval among unassisted models and delivers competitive image-captioning performance that surpasses zero-shot baselines on VQAv2 and FigureQA. The method demonstrates that cycle-consistency can provide a powerful supervisory signal for prompt design across code and vision-language tasks, revealing when and why forward, backward, and discriminator components influence gains. This work opens a pathway for purely in-context, self-refining prompting with broad applicability in multimodal AI systems.

Abstract

When LLMs perform zero-shot inference, they typically use a prompt with a task specification, and generate a completion. However, there is no work to explore the possibility of the reverse - going from completion to task specification. In this paper, we employ both directions to perform cycle-supervised learning entirely in-context. Our goal is to create a forward map f : X -> Y (e.g. image -> generated caption), coupled with a backward map g : Y -> X (e.g. caption -> generated image) to construct a cycle-consistency "loss" (formulated as an update to the prompt) to enforce g(f(X)) ~= X. The technique, called CyclePrompt, uses cycle-consistency as a free supervisory signal to iteratively craft the prompt. Importantly, CyclePrompt reinforces model performance without expensive fine-tuning, without training data, and without the complexity of external environments (e.g. compilers, APIs). We demonstrate CyclePrompt in two domains: code generation and image captioning. Our results on the HumanEval coding benchmark put us in first place on the leaderboard among models that do not rely on extra training data or usage of external environments, and third overall. Compared to the GPT4 baseline, we improve accuracy from 80.5% to 87.2%. In the vision-language space, we generate detailed image captions which outperform baseline zero-shot GPT4V captions, when tested against natural (VQAv2) and diagrammatic (FigureQA) visual question-answering benchmarks. To the best of our knowledge, this is the first use of self-supervised learning for prompting.
Paper Structure (25 sections, 2 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 25 sections, 2 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: The AutoEncoder-like structure for CyclePrompt. In the example, the first cycle caption is:"A person in a red jacket with a denim sleeve is holding a hot dog with yellow mustard in a bun. They are wearing a purple scarf and a red hat with a flower on it. There is a glimpse of another person in a red jacket in the background. It appears to be nighttime, and there's a white canopy in the background, possibly indicating an outdoor event.", and the generated hint is:"The person's stance should be slightly bent forward, and the hot dog should be held in both hands with a napkin wrapped around it". The hint refines the prompt to improve the generation of the following cycle.
  • Figure 2: A general flowchart for CyclePrompt, with applications for code generation and image captioning.
  • Figure 3: CyclePrompt inputs and outputs. Final caption:"Several bright green apples with a smooth, shiny texture are placed in a white bowl with a wide rim. The apples are unblemished, except for one in the foreground that has a small, dark indentation near the stem. The bowl sits on a dark surface, and the background is a blurred, dark brown, providing a stark contrast to the vibrant green of the apples. The apples are closely packed together, with one apple prominently in the foreground, slightly obscuring the apples behind it. The lighting is soft and diffused, highlighting the apples' texture and color. The apples have visible white speckles, and the bowl has a subtle shadow cast on the right side. The apples appear more matte than glossy, and the bowl's rim is thick and slightly curved outward."
  • Figure 4: CyclePrompt inputs and outputs. Final caption:"A horizontal bar graph with eight bars in distinct colors, each labeled with a color name on the left side. The bars are arranged from top to bottom in the order of Dark Cyan, Sky Blue, Deep Sky Blue, Chocolate, Deep Pink, Dim Gray, Medium Periwinkle, and Rebecca Purple. The x-axis is labeled 'xaxis label' with a scale from 0 to 100, and the y-axis is labeled 'yaxis label.' The graph has a title at the top that reads 'title.' The bars have varying lengths representing different values on the x-axis, with Dark Cyan being the longest and Rebecca Purple being the shortest. The graph is a clear, 2D representation with no grid lines, and the bars are solid with no patterns or textures. The title, axis labels, and color labels are all clearly legible. The graph background is white, and the bars are not stacked."
  • Figure 5: Comparison of Text-Image-Text (image generation) to Image-Text-Image (image captioning). When the input space is lower-complexity (e.g. text to image, in \ref{['fig:text-image-text-a-happy-day']}), the output space can comply with the input while continuously changing. When the input space is higher-complexity (e.g. image to text, in \ref{['fig:image-text-image-a-happy-day']},\ref{['fig:image-text-image-apples-20-cycles']}), both spaces are constrained and converge.
  • ...and 2 more figures