Table of Contents
Fetching ...

Multi-Modal Experience Inspired AI Creation

Qian Cao, Xu Chen, Ruihua Song, Hao Jiang, Guang Yang, Zhao Cao

TL;DR

This work addresses AI-driven text creation guided by sequential multi-modal experiences, introducing a multi-channel sequence-to-sequence framework with a spanning-influence mechanism and a 2D multi-modal fusion network. It couples topic-conditioned decoding with an Experience Enhanced Sentence Decoder and a curriculum negative sampling strategy, enabling asynchronous input-output associations and improved optimization. A new e-passage dataset is constructed, and experiments show that the proposed MMTG model outperforms baselines on automatic metrics and human judgments, with ablations highlighting the contributions of the sequential attention and curriculum strategy. The approach advances toward more realistic, human-like creative AI by jointly modeling visual and textual cues across time to generate coherent, meaningful text easily applicable to lyric generation and visual storytelling.

Abstract

AI creation, such as poem or lyrics generation, has attracted increasing attention from both industry and academic communities, with many promising models proposed in the past few years. Existing methods usually estimate the outputs based on single and independent visual or textual information. However, in reality, humans usually make creations according to their experiences, which may involve different modalities and be sequentially correlated. To model such human capabilities, in this paper, we define and solve a novel AI creation problem based on human experiences. More specifically, we study how to generate texts based on sequential multi-modal information. Compared with the previous works, this task is much more difficult because the designed model has to well understand and adapt the semantics among different modalities and effectively convert them into the output in a sequential manner. To alleviate these difficulties, we firstly design a multi-channel sequence-to-sequence architecture equipped with a multi-modal attention network. For more effective optimization, we then propose a curriculum negative sampling strategy tailored for the sequential inputs. To benchmark this problem and demonstrate the effectiveness of our model, we manually labeled a new multi-modal experience dataset. With this dataset, we conduct extensive experiments by comparing our model with a series of representative baselines, where we can demonstrate significant improvements in our model based on both automatic and human-centered metrics. The code and data are available at: \url{https://github.com/Aman-4-Real/MMTG}.

Multi-Modal Experience Inspired AI Creation

TL;DR

This work addresses AI-driven text creation guided by sequential multi-modal experiences, introducing a multi-channel sequence-to-sequence framework with a spanning-influence mechanism and a 2D multi-modal fusion network. It couples topic-conditioned decoding with an Experience Enhanced Sentence Decoder and a curriculum negative sampling strategy, enabling asynchronous input-output associations and improved optimization. A new e-passage dataset is constructed, and experiments show that the proposed MMTG model outperforms baselines on automatic metrics and human judgments, with ablations highlighting the contributions of the sequential attention and curriculum strategy. The approach advances toward more realistic, human-like creative AI by jointly modeling visual and textual cues across time to generate coherent, meaningful text easily applicable to lyric generation and visual storytelling.

Abstract

AI creation, such as poem or lyrics generation, has attracted increasing attention from both industry and academic communities, with many promising models proposed in the past few years. Existing methods usually estimate the outputs based on single and independent visual or textual information. However, in reality, humans usually make creations according to their experiences, which may involve different modalities and be sequentially correlated. To model such human capabilities, in this paper, we define and solve a novel AI creation problem based on human experiences. More specifically, we study how to generate texts based on sequential multi-modal information. Compared with the previous works, this task is much more difficult because the designed model has to well understand and adapt the semantics among different modalities and effectively convert them into the output in a sequential manner. To alleviate these difficulties, we firstly design a multi-channel sequence-to-sequence architecture equipped with a multi-modal attention network. For more effective optimization, we then propose a curriculum negative sampling strategy tailored for the sequential inputs. To benchmark this problem and demonstrate the effectiveness of our model, we manually labeled a new multi-modal experience dataset. With this dataset, we conduct extensive experiments by comparing our model with a series of representative baselines, where we can demonstrate significant improvements in our model based on both automatic and human-centered metrics. The code and data are available at: \url{https://github.com/Aman-4-Real/MMTG}.
Paper Structure (23 sections, 13 equations, 5 figures, 4 tables)

This paper contains 23 sections, 13 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: A toy example of the human creation process. The inputs and outputs are sequentially corresponded in a loose manner, that is, each input may influence multiple outputs.
  • Figure 2: The framework of our proposed MMTG model. Experiences are shown in image and text sequences. An image corresponds to its text at the same time step. The modules of Multi-Channel Sequence Processor, Spanning Influence Modeling, Multi-Modal Fusion Network, and Experience Enhanced Sentence Decoder are presented from left to right.
  • Figure 3: Results of ablation study on different variants. "$\alpha$ attn." and "t-prt." refer to $\alpha$-attention and t-prompt.
  • Figure 4: A case of a topic, five image-text pairs as experiences, ground-truth, and the generated lyrics by our MMTG model.
  • Figure 5: Results of ablation study on different training strategies. "CL." and "Neg." are short for curriculum learning and negative samples respectively.

Theorems & Definitions (2)

  • Remark
  • Remark