Table of Contents
Fetching ...

Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation

Zhiwei Zhang, Yuliang Liu

TL;DR

This work tackles accountability in multimodal generation by introducing Accountable Text-based Visual Re-creation (ATVC), where a model must produce a re-created image M and a language-based explanation A given a visual input V and a text query T, while sometimes refusing unfulfillable or prohibited instructions. The authors build two large datasets, CLEVR-ATVC (synthetic, ~620K training pairs) and Fruit-ATVC (real, ~27.5K training pairs), each paired with predefined rules that learners must follow to answer yes/no and provide explanations for refusals. A two-stage training pipeline is proposed: (i) a discrete variational auto-encoder based image encoder that compresses images into $32×32=1024$ latent tokens with a codebook of size $8192$, and (ii) an auto-regressive transformer that consumes a token sequence $T_{seq}$ formed from T, V, M, A to generate both M and A, optimizing a transformer loss while not backpropagating through cannot/forbidden cases. Across experiments, the method achieves meaningful re-creation and explanatory feedback, with VQVAE outperforming VQGAN in preserving object properties, and demonstrates the feasibility of accountable, high-signal multimodal interactions with potential safety and deployment benefits.

Abstract

The recent success of ChatGPT and GPT-4 has drawn widespread attention to multimodal dialogue systems. However, there is a lack of datasets in the academic community that can effectively evaluate the multimodal generation capabilities of Visual Language Models (VLMs) in textual-visual chat tasks. In this paper, we address this gap by introducing two novel multimodal datasets: the synthetic CLEVR-ATVC dataset (620K) and the manually pictured Fruit-ATVC dataset (50K). These datasets incorporate both visual and text-based inputs and outputs. Furthermore, to facilitate the accountability of multimodal systems in rejecting human requests, similar to language-based ChatGPT conversations, we introduce specific rules as supervisory signals within the datasets. This allows the trained VLM to provide a yes or no answer after engaging in visual and textual reasoning, accompanied by a language explanation to clarify the reasons behind the inability to execute the given human instruction. Our proposed method involves a two-stage training procedure, which includes training the image auto-encoder and the auto-regressive transformer from scratch. The first stage employs a discrete variational autoencoder (dVAE) to compress each image into concise tokens, which are then combined with text tokens into a single data stream. This stream is subsequently fed into the decoder-based transformer to generate visual re-creations and textual feedback in the second stage. We conduct comprehensive analyses of experimental results, focusing on re-created image quality, answer accuracy, and the model's behavior when faced with uncertainty and imperfect user queries. Through our explorations and findings, we aim to contribute valuable insights into the accountability of textual-visual generative models.

Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation

TL;DR

This work tackles accountability in multimodal generation by introducing Accountable Text-based Visual Re-creation (ATVC), where a model must produce a re-created image M and a language-based explanation A given a visual input V and a text query T, while sometimes refusing unfulfillable or prohibited instructions. The authors build two large datasets, CLEVR-ATVC (synthetic, ~620K training pairs) and Fruit-ATVC (real, ~27.5K training pairs), each paired with predefined rules that learners must follow to answer yes/no and provide explanations for refusals. A two-stage training pipeline is proposed: (i) a discrete variational auto-encoder based image encoder that compresses images into latent tokens with a codebook of size , and (ii) an auto-regressive transformer that consumes a token sequence formed from T, V, M, A to generate both M and A, optimizing a transformer loss while not backpropagating through cannot/forbidden cases. Across experiments, the method achieves meaningful re-creation and explanatory feedback, with VQVAE outperforming VQGAN in preserving object properties, and demonstrates the feasibility of accountable, high-signal multimodal interactions with potential safety and deployment benefits.

Abstract

The recent success of ChatGPT and GPT-4 has drawn widespread attention to multimodal dialogue systems. However, there is a lack of datasets in the academic community that can effectively evaluate the multimodal generation capabilities of Visual Language Models (VLMs) in textual-visual chat tasks. In this paper, we address this gap by introducing two novel multimodal datasets: the synthetic CLEVR-ATVC dataset (620K) and the manually pictured Fruit-ATVC dataset (50K). These datasets incorporate both visual and text-based inputs and outputs. Furthermore, to facilitate the accountability of multimodal systems in rejecting human requests, similar to language-based ChatGPT conversations, we introduce specific rules as supervisory signals within the datasets. This allows the trained VLM to provide a yes or no answer after engaging in visual and textual reasoning, accompanied by a language explanation to clarify the reasons behind the inability to execute the given human instruction. Our proposed method involves a two-stage training procedure, which includes training the image auto-encoder and the auto-regressive transformer from scratch. The first stage employs a discrete variational autoencoder (dVAE) to compress each image into concise tokens, which are then combined with text tokens into a single data stream. This stream is subsequently fed into the decoder-based transformer to generate visual re-creations and textual feedback in the second stage. We conduct comprehensive analyses of experimental results, focusing on re-created image quality, answer accuracy, and the model's behavior when faced with uncertainty and imperfect user queries. Through our explorations and findings, we aim to contribute valuable insights into the accountability of textual-visual generative models.
Paper Structure (28 sections, 5 equations, 13 figures, 12 tables)

This paper contains 28 sections, 5 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: The left figure shows that a multimodal generative model can simultaneously recreate visual input and give textual feedback when faced with human instructions, especially it can reject some commands. The right figure illustrates the overall framework of our method. The model is required to generate a re-created image (M) and a textual feedback (A) conditioned on the visual input (V) and text-based user query (T), and the language-based explanation is also given for those instructions that cannot be executed and the prohibited instructions.
  • Figure 2: The mention frequency of different fruits and the annotation interface for Fruit-ATVC dataset.
  • Figure 3: The main format visualization of data annotation file. The details can be found in the appendix.
  • Figure 4: Image re-creation results on the CLEVR-ATVC sub-dataset using imperfect query. The red text represents the short or imperfect query.
  • Figure 5: Analyse for the results of VQGAN-based image encoder. VQGAN frequently changes the colors of the objects, which downgrades the performance.
  • ...and 8 more figures