CoCoG: Controllable Visual Stimuli Generation based on Human Concept Representations

Chen Wei; Jiachen Zou; Dietmar Heinke; Quanying Liu

CoCoG: Controllable Visual Stimuli Generation based on Human Concept Representations

Chen Wei, Jiachen Zou, Dietmar Heinke, Quanying Liu

TL;DR

The experiments with CoCoG indicate that the reliable concept embeddings in CoCoG allows to predict human behavior with 64.07% accuracy in the THINGS-similarity dataset and CoCoG can generate diverse stimuli through the control of concepts.

Abstract

A central question for cognitive science is to understand how humans process visual objects, i.e, to uncover human low-dimensional concept representation space from high-dimensional visual stimuli. Generating visual stimuli with controlling concepts is the key. However, there are currently no generative models in AI to solve this problem. Here, we present the Concept based Controllable Generation (CoCoG) framework. CoCoG consists of two components, a simple yet efficient AI agent for extracting interpretable concept and predicting human decision-making in visual similarity judgment tasks, and a conditional generation model for generating visual stimuli given the concepts. We quantify the performance of CoCoG from two aspects, the human behavior prediction accuracy and the controllable generation ability. The experiments with CoCoG indicate that 1) the reliable concept embeddings in CoCoG allows to predict human behavior with 64.07\% accuracy in the THINGS-similarity dataset; 2) CoCoG can generate diverse objects through the control of concepts; 3) CoCoG can manipulate human similarity judgment behavior by intervening key concepts. CoCoG offers visual objects with controlling concepts to advance our understanding of causality in human cognition. The code of CoCoG is available at \url{https://github.com/ncclab-sustech/CoCoG}.

CoCoG: Controllable Visual Stimuli Generation based on Human Concept Representations

TL;DR

Abstract

Paper Structure (42 sections, 13 equations, 11 figures, 2 tables)

This paper contains 42 sections, 13 equations, 11 figures, 2 tables.

Introduction
Method
Concept encoder for embedding low-dimensional concepts
Two-stage concept decoder for controllable visual stimuli generation
Stage I - Prior diffusion
Stage II - CLIP guidance generation
CLIP embedding as an intermediate variable
Model Validation
Concept encoder can predict and explain human behaviors
Concept decoder can generate visual objects consistent with concept embedding
CoCoG for studying counterfactual explanations of human behaviors
Flexible controlling of generated objects with text prompts
Manipulating the similarity judgment decisions by intervening the key concepts
Related Works
Concept Embeddings
...and 27 more sections

Figures (11)

Figure 1: Motivations of our work.
Figure 2: The framework of CoCoG. (a) The concept encoder for learning concept embeddings using a similarity judgment behavior dataset. Visual objects are processed through the CLIP image encoder to obtain CLIP image embeddings, and then passed through a learnable concept projector to obtain concept embeddings. Then, we can predict similarity judgment behaviors by compute similarity with others. (b) The stage I of the concept decoder, the prior diffusion for determining the concept embedding based on our desired judgment behavior (e.g., here modifying the concept “colorful”). Then, we train a diffusion model conditioned on the concept embedding to generate the corresponding CLIP embedding. (c) Stage II of the concept decoder, the CLIP guided generation. It uses the CLIP embedding as a condition to guide the pre-trained image diffusion generation model to generate VAE latent, which are then processed through the VAE decoder to produce the generated visual object.
Figure 3: The performance of the concept encoder in predicting and explaining human behavior. (a) Our model's prediction accuracy for similarity judgment behavior is 64.07%, exceeding the previous SOTA model VICE's 63.27% (blue dashed line), with only slightly lower than the noise ceiling (gray dashed line)The Pearson correlation coefficient between the similarity of visual objects predicted by our model and by VICE is 0.94; (b) Example visual objects and their concept embeddings, with dashed lines representing the 90th percentile of activated concepts; (c) Example visual objects with significant activation on the concept Home tools and Baked food, respectively.
Figure 4: The visual objects generated by controlling the concept embeddings.
Figure 5: Measurements of the performance of generated visual objects. (a) The similarity between random visual objects and the target concept embeddings (blue), and the similarity between generated visual objects and the target concept embeddings (orange); (b) The similarity between visual objects and target concept embeddings as the guidance scale changes (blue), and the diversity of visual objects as the guidance scale changes (orange).
...and 6 more figures

CoCoG: Controllable Visual Stimuli Generation based on Human Concept Representations

TL;DR

Abstract

CoCoG: Controllable Visual Stimuli Generation based on Human Concept Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (11)