Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
Zhiwei Zhang, Yuliang Liu
TL;DR
This work tackles accountability in multimodal generation by introducing Accountable Text-based Visual Re-creation (ATVC), where a model must produce a re-created image M and a language-based explanation A given a visual input V and a text query T, while sometimes refusing unfulfillable or prohibited instructions. The authors build two large datasets, CLEVR-ATVC (synthetic, ~620K training pairs) and Fruit-ATVC (real, ~27.5K training pairs), each paired with predefined rules that learners must follow to answer yes/no and provide explanations for refusals. A two-stage training pipeline is proposed: (i) a discrete variational auto-encoder based image encoder that compresses images into $32×32=1024$ latent tokens with a codebook of size $8192$, and (ii) an auto-regressive transformer that consumes a token sequence $T_{seq}$ formed from T, V, M, A to generate both M and A, optimizing a transformer loss while not backpropagating through cannot/forbidden cases. Across experiments, the method achieves meaningful re-creation and explanatory feedback, with VQVAE outperforming VQGAN in preserving object properties, and demonstrates the feasibility of accountable, high-signal multimodal interactions with potential safety and deployment benefits.
Abstract
The recent success of ChatGPT and GPT-4 has drawn widespread attention to multimodal dialogue systems. However, there is a lack of datasets in the academic community that can effectively evaluate the multimodal generation capabilities of Visual Language Models (VLMs) in textual-visual chat tasks. In this paper, we address this gap by introducing two novel multimodal datasets: the synthetic CLEVR-ATVC dataset (620K) and the manually pictured Fruit-ATVC dataset (50K). These datasets incorporate both visual and text-based inputs and outputs. Furthermore, to facilitate the accountability of multimodal systems in rejecting human requests, similar to language-based ChatGPT conversations, we introduce specific rules as supervisory signals within the datasets. This allows the trained VLM to provide a yes or no answer after engaging in visual and textual reasoning, accompanied by a language explanation to clarify the reasons behind the inability to execute the given human instruction. Our proposed method involves a two-stage training procedure, which includes training the image auto-encoder and the auto-regressive transformer from scratch. The first stage employs a discrete variational autoencoder (dVAE) to compress each image into concise tokens, which are then combined with text tokens into a single data stream. This stream is subsequently fed into the decoder-based transformer to generate visual re-creations and textual feedback in the second stage. We conduct comprehensive analyses of experimental results, focusing on re-created image quality, answer accuracy, and the model's behavior when faced with uncertainty and imperfect user queries. Through our explorations and findings, we aim to contribute valuable insights into the accountability of textual-visual generative models.
