Table of Contents
Fetching ...

Generated Contents Enrichment

Mahdi Naseri, Jiayan Qiu, Zhou Wang

TL;DR

A deep end-to-end adversarial method that explicitly explores semantics and inter-semantic relationships during the enrichment process of Generated Contents Enrichment, to generate content that is visually realistic, structurally coherent, and semantically abundant.

Abstract

In this paper, we investigate a novel artificial intelligence generation task termed Generated Contents Enrichment (GCE). Conventional AI content generation produces visually realistic content by implicitly enriching the given textual description based on limited semantic descriptions. Unlike this traditional task, our proposed GCE strives to perform content enrichment explicitly in both the visual and textual domains. The goal is to generate content that is visually realistic, structurally coherent, and semantically abundant. To tackle GCE, we propose a deep end-to-end adversarial method that explicitly explores semantics and inter-semantic relationships during the enrichment process. Our approach first models the input description as a scene graph, where nodes represent objects and edges capture inter-object relationships. We then adopt Graph Convolutional Networks on top of the input scene description to predict additional enriching objects and their relationships with the existing ones. Finally, the enriched description is passed to an image synthesis model to generate the corresponding visual content. Experiments conducted on the Visual Genome dataset demonstrate the effectiveness of our method, producing promising and visually plausible results.

Generated Contents Enrichment

TL;DR

A deep end-to-end adversarial method that explicitly explores semantics and inter-semantic relationships during the enrichment process of Generated Contents Enrichment, to generate content that is visually realistic, structurally coherent, and semantically abundant.

Abstract

In this paper, we investigate a novel artificial intelligence generation task termed Generated Contents Enrichment (GCE). Conventional AI content generation produces visually realistic content by implicitly enriching the given textual description based on limited semantic descriptions. Unlike this traditional task, our proposed GCE strives to perform content enrichment explicitly in both the visual and textual domains. The goal is to generate content that is visually realistic, structurally coherent, and semantically abundant. To tackle GCE, we propose a deep end-to-end adversarial method that explicitly explores semantics and inter-semantic relationships during the enrichment process. Our approach first models the input description as a scene graph, where nodes represent objects and edges capture inter-object relationships. We then adopt Graph Convolutional Networks on top of the input scene description to predict additional enriching objects and their relationships with the existing ones. Finally, the enriched description is passed to an image synthesis model to generate the corresponding visual content. Experiments conducted on the Visual Genome dataset demonstrate the effectiveness of our method, producing promising and visually plausible results.
Paper Structure (36 sections, 17 equations, 7 figures, 6 tables)

This paper contains 36 sections, 17 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Initially, the input textual description (a) is represented as a scene graph (b). Subsequently, we enrich the scene description (b) iteratively by appending additional objects and their relationships with existing scene elements. The enriching content (c) should preserve the same essential scene characteristics as the original input (b). This enrichment is executed utilizing our proposed end-to-end adversarial graph convolutional framework. Both the input (b) and enriched (c) scene graphs are then employed to synthesize simple (d) and enriched (e) images, respectively. In comparison to the simple image (d), the enriched one (e) not only reflects the essence of the initial input description (b) but also integrates more relevant intricate details akin to those present in real-world images (f) found in the Visual Genome dataset krishna2017visual.
  • Figure 2: High-Level Overview of our end-to-end Generated Contents Enrichment framework during the training phase. In Stage 1, the input scene graph (SG) is fed to the Scene Graph Enricher ${G_{sg}}$ to produce an enriched SG. Besides, a pair of local and global discriminators in the SG Critic $D_{sg}$ are jointly trained to differentiate between original and enriched SGs. These discriminators aid the enrichment process in constructing realistic, structurally coherent, and semantically meaningful scenes. In Stage 2, the Image Synthesizer $G_{im}$ leverages the resultant enriched SG to generate an image. In Stage 3, essential visual and textual scene characteristics are extracted in the Visual Scene Characterizer $S_{cf}$ and the Image-Text Aligner $M_{im\_sg}$. These two components ensure that the enriched image appropriately reflects the original description's inherent characteristics.
  • Figure 3: Qualitative Results. Samples of scene graphs from the Visual Genome test split as the input descriptions are enriched, along with their synthesized images featuring richer content. The simple image is generated directly from the input description, and the enriched images are produced from their corresponding enriched descriptions. The input scene graphs (a) and the enriched scene graphs generated by our model (c) are represented as textual descriptions.
  • Figure 4: Graph Convolutional Network (GCN) and its building block GConv are employed in Stage 1: Generative Adversarial SG Enrichment.
  • Figure 5: Enriching Edge Detector receives the nodes of a graph but not their edges. It also accepts hidden object vectors from the Enriching Object GCN as input. After concatenation, each node is fed to two neural networks with the same architecture but separate sets of weights to form hidden vectors for subjects and objects of a relationship. This part aims to transfer the nodes to another space where cosine similarity represents the potential of existing edges between two nodes. Therefore, the resulting subject and object vectors form a score matrix implemented as matrix multiplication.
  • ...and 2 more figures