Table of Contents
Fetching ...

CoCoG: Controllable Visual Stimuli Generation based on Human Concept Representations

Chen Wei, Jiachen Zou, Dietmar Heinke, Quanying Liu

TL;DR

The experiments with CoCoG indicate that the reliable concept embeddings in CoCoG allows to predict human behavior with 64.07% accuracy in the THINGS-similarity dataset and CoCoG can generate diverse stimuli through the control of concepts.

Abstract

A central question for cognitive science is to understand how humans process visual objects, i.e, to uncover human low-dimensional concept representation space from high-dimensional visual stimuli. Generating visual stimuli with controlling concepts is the key. However, there are currently no generative models in AI to solve this problem. Here, we present the Concept based Controllable Generation (CoCoG) framework. CoCoG consists of two components, a simple yet efficient AI agent for extracting interpretable concept and predicting human decision-making in visual similarity judgment tasks, and a conditional generation model for generating visual stimuli given the concepts. We quantify the performance of CoCoG from two aspects, the human behavior prediction accuracy and the controllable generation ability. The experiments with CoCoG indicate that 1) the reliable concept embeddings in CoCoG allows to predict human behavior with 64.07\% accuracy in the THINGS-similarity dataset; 2) CoCoG can generate diverse objects through the control of concepts; 3) CoCoG can manipulate human similarity judgment behavior by intervening key concepts. CoCoG offers visual objects with controlling concepts to advance our understanding of causality in human cognition. The code of CoCoG is available at \url{https://github.com/ncclab-sustech/CoCoG}.

CoCoG: Controllable Visual Stimuli Generation based on Human Concept Representations

TL;DR

The experiments with CoCoG indicate that the reliable concept embeddings in CoCoG allows to predict human behavior with 64.07% accuracy in the THINGS-similarity dataset and CoCoG can generate diverse stimuli through the control of concepts.

Abstract

A central question for cognitive science is to understand how humans process visual objects, i.e, to uncover human low-dimensional concept representation space from high-dimensional visual stimuli. Generating visual stimuli with controlling concepts is the key. However, there are currently no generative models in AI to solve this problem. Here, we present the Concept based Controllable Generation (CoCoG) framework. CoCoG consists of two components, a simple yet efficient AI agent for extracting interpretable concept and predicting human decision-making in visual similarity judgment tasks, and a conditional generation model for generating visual stimuli given the concepts. We quantify the performance of CoCoG from two aspects, the human behavior prediction accuracy and the controllable generation ability. The experiments with CoCoG indicate that 1) the reliable concept embeddings in CoCoG allows to predict human behavior with 64.07\% accuracy in the THINGS-similarity dataset; 2) CoCoG can generate diverse objects through the control of concepts; 3) CoCoG can manipulate human similarity judgment behavior by intervening key concepts. CoCoG offers visual objects with controlling concepts to advance our understanding of causality in human cognition. The code of CoCoG is available at \url{https://github.com/ncclab-sustech/CoCoG}.
Paper Structure (42 sections, 13 equations, 11 figures, 2 tables)

This paper contains 42 sections, 13 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Motivations of our work.
  • Figure 2: The framework of CoCoG. (a) The concept encoder for learning concept embeddings using a similarity judgment behavior dataset. Visual objects are processed through the CLIP image encoder to obtain CLIP image embeddings, and then passed through a learnable concept projector to obtain concept embeddings. Then, we can predict similarity judgment behaviors by compute similarity with others. (b) The stage I of the concept decoder, the prior diffusion for determining the concept embedding based on our desired judgment behavior (e.g., here modifying the concept “colorful”). Then, we train a diffusion model conditioned on the concept embedding to generate the corresponding CLIP embedding. (c) Stage II of the concept decoder, the CLIP guided generation. It uses the CLIP embedding as a condition to guide the pre-trained image diffusion generation model to generate VAE latent, which are then processed through the VAE decoder to produce the generated visual object.
  • Figure 3: The performance of the concept encoder in predicting and explaining human behavior. (a) Our model's prediction accuracy for similarity judgment behavior is 64.07%, exceeding the previous SOTA model VICE's 63.27% (blue dashed line), with only slightly lower than the noise ceiling (gray dashed line)The Pearson correlation coefficient between the similarity of visual objects predicted by our model and by VICE is 0.94; (b) Example visual objects and their concept embeddings, with dashed lines representing the 90th percentile of activated concepts; (c) Example visual objects with significant activation on the concept Home tools and Baked food, respectively.
  • Figure 4: The visual objects generated by controlling the concept embeddings.
  • Figure 5: Measurements of the performance of generated visual objects. (a) The similarity between random visual objects and the target concept embeddings (blue), and the similarity between generated visual objects and the target concept embeddings (orange); (b) The similarity between visual objects and target concept embeddings as the guidance scale changes (blue), and the diversity of visual objects as the guidance scale changes (orange).
  • ...and 6 more figures