Bridging the Intent Gap: Knowledge-Enhanced Visual Generation

Yi Cheng; Ziwei Xu; Dongyun Lin; Harry Cheng; Yongkang Wong; Ying Sun; Joo Hwee Lim; Mohan Kankanhalli

Bridging the Intent Gap: Knowledge-Enhanced Visual Generation

Yi Cheng, Ziwei Xu, Dongyun Lin, Harry Cheng, Yongkang Wong, Ying Sun, Joo Hwee Lim, Mohan Kankanhalli

TL;DR

This work tackles the persistent gap between user intent and generated visuals by introducing a knowledge-enhanced iterative refinement framework. It combines multiple knowledge sources—human insight, pre-trained models, logic rules, and world knowledge—with a knowledge-based feedback module to iteratively align content with user prompts. The framework comprises a semantic-language–to–structured-representation backbone and a feedback-driven loop that updates prompts, representations, or models, formalized through $M: \mathcal{X} \times \mathcal{V} \rightarrow \mathcal{Y}$ and $\hat{y}^{(k)} = M(x,v_f^{(k-1)})$, $v_f^{(k-1)} = h(x, \hat{y}^{(k-1)})$. Preliminary diffusion-based experiments show substantial improvements over training-free baselines and competitive results with training-based approaches, highlighting the potential of knowledge-guided refinement for intention-aligned visual content generation and suggesting paths toward automatic knowledge integration and multi-modal extension.

Abstract

For visual content generation, discrepancies between user intentions and the generated content have been a longstanding problem. This discrepancy arises from two main factors. First, user intentions are inherently complex, with subtle details not fully captured by input prompts. The absence of such details makes it challenging for generative models to accurately reflect the intended meaning, leading to a mismatch between the desired and generated output. Second, generative models trained on visual-label pairs lack the comprehensive knowledge to accurately represent all aspects of the input data in their generated outputs. To address these challenges, we propose a knowledge-enhanced iterative refinement framework for visual content generation. We begin by analyzing and identifying the key challenges faced by existing generative models. Then, we introduce various knowledge sources, including human insights, pre-trained models, logic rules, and world knowledge, which can be leveraged to address these challenges. Furthermore, we propose a novel visual generation framework that incorporates a knowledge-based feedback module to iteratively refine the generation process. This module gradually improves the alignment between the generated content and user intentions. We demonstrate the efficacy of the proposed framework through preliminary results, highlighting the potential of knowledge-enhanced generative models for intention-aligned content generation.

Bridging the Intent Gap: Knowledge-Enhanced Visual Generation

TL;DR

and

. Preliminary diffusion-based experiments show substantial improvements over training-free baselines and competitive results with training-based approaches, highlighting the potential of knowledge-guided refinement for intention-aligned visual content generation and suggesting paths toward automatic knowledge integration and multi-modal extension.

Abstract

Paper Structure (29 sections, 1 equation, 6 figures, 2 tables)

This paper contains 29 sections, 1 equation, 6 figures, 2 tables.

Introduction
Challenges and Knowledge
Challenges
Sufficient Input
Factual Integrity
Semantic Completeness.
Fair Representation.
Visual Fidelity.
Knowledge Sources
Human Insight
Pre-trained Models
Logic Rules
World Knowledge
Knowledge-Enhanced Framework
Overall Architecture
...and 14 more sections

Figures (6)

Figure 1: Our proposed knowledge-enhanced framework for visual content generation. It introduces a feedback loop to the conventional single-round generation process (yellow region). Specifically, a knowledge-based feedback module generates diverse feedback based on the text input and generated content. This feedback is leveraged to iteratively refine the generative process, improving output quality and alignment with user intentions. The conversion of feedback into specific update procedures can be performed either by a human user or automated using pre-defined rules, allowing for varying levels of user involvement.
Figure 2: Summary of challenges faced by generative models in visual content generation. Orange denotes challenges related to complexity of user intention, while green denotes challenges related to inability of generative models to accurately represent user input. Each limitation is described along with potential knowledge sources that can help address them. The icons represent different knowledge sources: denotes human insight, denotes pre-trained models, denotes logic rules, and denotes world knowledge.
Figure 3: Examples illustrating the challenges faced by generative models. (a) Lacking sufficient input details on the crowd, style, and other characteristics of the street. (b) Factual integrity error leading to an anachronistic depiction of Singapore in 1966 featuring modern skyscrapers. (c) Semantic completeness on the flaws in mixing attributes of different objects. (d) Unfair representation of gender: engineers are predominantly depicted as male. (e) Visual fidelity concern with artifacts in buildings, such as skewed lines.
Figure 4: Overview of the framework. User intention is expressed as user input (gray region). The Content Generation Baseline then converts user input into structured representations, which are fed into the generative model to produce content. The Knowledge-Based Feedback Module evaluates the alignment between user input and generated content using knowledge from multiple sources. The feedback is used to enhance the generation process by updating essential components.
Figure 5: An example illustrating the proposed framework for image generation. In the first iteration, the generated content does not meet user intentions due to a vague input description. Human feedback adds details (spatial relation between girl and dog) to the input text, and the image is regenerated. In the second iteration, feedback from foundation models identifies semantic incompleteness (two dogs instead of one). The structured representation is updated with object spatial location using bounding boxes (bbox). After refinement, the model generates an image aligned with the user's intention based on the feedback.
...and 1 more figures

Bridging the Intent Gap: Knowledge-Enhanced Visual Generation

TL;DR

Abstract

Bridging the Intent Gap: Knowledge-Enhanced Visual Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)