Table of Contents
Fetching ...

Contextual Knowledge Pursuit for Faithful Visual Synthesis

Jinqi Luo, Kwan Ho Ryan Chan, Dimitris Dimos, René Vidal

TL;DR

Contextual Knowledge Pursuit is evaluated, showing that CKPT is capable of generating faithful and semantically rich content across diverse visual domains, offering a promising data source for zero-shot synthesis and filtered fine-tuning of text-to-vision generative models.

Abstract

Modern text-to-vision generative models often hallucinate when the prompt describing the scene to be generated is underspecified. In large language models (LLMs), a prevalent strategy to reduce hallucinations is to retrieve factual knowledge from an external database. While such retrieval augmentation strategies have great potential to enhance text-to-vision generators, existing static top-K retrieval methods explore the knowledge pool once, missing the broader context necessary for high-quality generation. Furthermore, LLMs internally possess rich world knowledge learned during large-scale training (parametric knowledge) that could mitigate the need for external data retrieval. This paper proposes Contextual Knowledge Pursuit (CKPT), a framework that leverages the complementary strengths of external and parametric knowledge to help generators produce reliable visual content. Instead of the one-time retrieval of facts from an external database to improve a given prompt, CKPT uses (1) an LLM to decide whether to seek external knowledge or to self-elicit descriptions from LLM parametric knowledge, (2) a knowledge pursuit process to contextually seek and sequentially gather most relevant facts, (3) a knowledge aggregator for prompt enhancement with the gathered fact context, and (4) a filtered fine-tuning objective to improve visual synthesis with richer prompts. We evaluate CKPT across multiple text-driven generative tasks (image, 3D rendering, and video) on datasets of rare objects and daily scenarios. Our results show that CKPT is capable of generating faithful and semantically rich content across diverse visual domains, offering a promising data source for zero-shot synthesis and filtered fine-tuning of text-to-vision generative models.

Contextual Knowledge Pursuit for Faithful Visual Synthesis

TL;DR

Contextual Knowledge Pursuit is evaluated, showing that CKPT is capable of generating faithful and semantically rich content across diverse visual domains, offering a promising data source for zero-shot synthesis and filtered fine-tuning of text-to-vision generative models.

Abstract

Modern text-to-vision generative models often hallucinate when the prompt describing the scene to be generated is underspecified. In large language models (LLMs), a prevalent strategy to reduce hallucinations is to retrieve factual knowledge from an external database. While such retrieval augmentation strategies have great potential to enhance text-to-vision generators, existing static top-K retrieval methods explore the knowledge pool once, missing the broader context necessary for high-quality generation. Furthermore, LLMs internally possess rich world knowledge learned during large-scale training (parametric knowledge) that could mitigate the need for external data retrieval. This paper proposes Contextual Knowledge Pursuit (CKPT), a framework that leverages the complementary strengths of external and parametric knowledge to help generators produce reliable visual content. Instead of the one-time retrieval of facts from an external database to improve a given prompt, CKPT uses (1) an LLM to decide whether to seek external knowledge or to self-elicit descriptions from LLM parametric knowledge, (2) a knowledge pursuit process to contextually seek and sequentially gather most relevant facts, (3) a knowledge aggregator for prompt enhancement with the gathered fact context, and (4) a filtered fine-tuning objective to improve visual synthesis with richer prompts. We evaluate CKPT across multiple text-driven generative tasks (image, 3D rendering, and video) on datasets of rare objects and daily scenarios. Our results show that CKPT is capable of generating faithful and semantically rich content across diverse visual domains, offering a promising data source for zero-shot synthesis and filtered fine-tuning of text-to-vision generative models.
Paper Structure (34 sections, 3 equations, 18 figures, 4 tables)

This paper contains 34 sections, 3 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Text-driven generative models often produce (I) unsatisfactory synthesis. Our proposed framework recursively queries facts in an agent-selected knowledge paradigm to achieve (II) faithful multimodal synthesis.
  • Figure 2: The CKPT framework. The user inputs a generic prompt that lacks details and CKPT decides the knowledge regime. Then CKPT recursively picks the most informative given the current state of the knowledge context and appends this fact to update the context. The LLM aggregates the final context to produce a faithfully enhanced caption for text-driven generators.
  • Figure 3: Generative captions obtained by prompting GPT-4 and by CKPT. The former yields generic descriptions, while CKPT produces a more detailed, precise, and faithful prompt.
  • Figure 4: Comparison of images generated from original captions and CKPT on the GBIF (upper two rows), MSCOCO (three captions on the lower left), and GUIE LAION-5B (two captions on the lower right) datasets. Blue columns demonstrate results from the external retrieval, while green columns emphasize the parametric elicitation. We outline deficits in bounding boxes (e.g., the missing deer legs, the stuck closed fridge door) and underline notable concepts that the generator should express sufficiently.
  • Figure 5: The CKPT-enhanced captions of Figure \ref{['fig:text_to_image_all_datasets']}. We observe that the descriptive captions generated by our framework well capture core semantics and concepts from contextual knowledge queries.
  • ...and 13 more figures