Table of Contents
Fetching ...

Controlling Structured Output Representations from Attributes using Conditional Generative Models

Mohamed Debbagh

TL;DR

This work addresses generating high-dimensional structured outputs from low-dimensional attribute vectors using a CVAE that conditions latent structure on attributes to achieve disentangled, multimodal generation. By learning a conditional prior and posterior, the model enables controlled sampling across attribute configurations, evaluated on CelebA and CUB-200-2011 with a β-weighted ELBO to balance reconstruction and regularization. Results show that higher regularization (larger $\beta$) improves sample plausibility and latent disentanglement, though high-frequency details remain challenging in sparse data; data augmentation is suggested for further gains. Overall, the approach demonstrates the viability of attribute-conditioned CVAEs for targeted, controllable image synthesis with implications for interpretable generative modeling in vision tasks.

Abstract

Structured output representation is a generative task explored in computer vision that often times requires the mapping of low dimensional features to high dimensional structured outputs. Losses in complex spatial information in deterministic approaches such as Convolutional Neural Networks (CNN) lead to uncertainties and ambiguous structures within a single output representation. A probabilistic approach through deep Conditional Generative Models (CGM) is presented by Sohn et al. in which a particular model known as the Conditional Variational Auto-encoder (CVAE) is introduced and explored. While the original paper focuses on the task of image segmentation, this paper adopts the CVAE framework for the task of controlled output representation through attributes. This approach allows us to learn a disentangled multimodal prior distribution, resulting in more controlled and robust approach to sample generation. In this work we recreate the CVAE architecture and train it on images conditioned on various attributes obtained from two image datasets; the Large-scale CelebFaces Attributes (CelebA) dataset and the Caltech-UCSD Birds (CUB-200-2011) dataset. We attempt to generate new faces with distinct attributes such as hair color and glasses, as well as different bird species samples with various attributes. We further introduce strategies for improving generalized sample generation by applying a weighted term to the variational lower bound.

Controlling Structured Output Representations from Attributes using Conditional Generative Models

TL;DR

This work addresses generating high-dimensional structured outputs from low-dimensional attribute vectors using a CVAE that conditions latent structure on attributes to achieve disentangled, multimodal generation. By learning a conditional prior and posterior, the model enables controlled sampling across attribute configurations, evaluated on CelebA and CUB-200-2011 with a β-weighted ELBO to balance reconstruction and regularization. Results show that higher regularization (larger ) improves sample plausibility and latent disentanglement, though high-frequency details remain challenging in sparse data; data augmentation is suggested for further gains. Overall, the approach demonstrates the viability of attribute-conditioned CVAEs for targeted, controllable image synthesis with implications for interpretable generative modeling in vision tasks.

Abstract

Structured output representation is a generative task explored in computer vision that often times requires the mapping of low dimensional features to high dimensional structured outputs. Losses in complex spatial information in deterministic approaches such as Convolutional Neural Networks (CNN) lead to uncertainties and ambiguous structures within a single output representation. A probabilistic approach through deep Conditional Generative Models (CGM) is presented by Sohn et al. in which a particular model known as the Conditional Variational Auto-encoder (CVAE) is introduced and explored. While the original paper focuses on the task of image segmentation, this paper adopts the CVAE framework for the task of controlled output representation through attributes. This approach allows us to learn a disentangled multimodal prior distribution, resulting in more controlled and robust approach to sample generation. In this work we recreate the CVAE architecture and train it on images conditioned on various attributes obtained from two image datasets; the Large-scale CelebFaces Attributes (CelebA) dataset and the Caltech-UCSD Birds (CUB-200-2011) dataset. We attempt to generate new faces with distinct attributes such as hair color and glasses, as well as different bird species samples with various attributes. We further introduce strategies for improving generalized sample generation by applying a weighted term to the variational lower bound.
Paper Structure (20 sections, 10 equations, 5 figures, 1 table)

This paper contains 20 sections, 10 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview framework of the a) VAE and b) CVAE. This figure covers both deterministic and probabilistic components for each model.
  • Figure 2: Overview of our CVAE architecture's training and sampling pipeline
  • Figure 3: Samples generated from model trained on $\beta$ values ranging from 0.25 to 0.9
  • Figure 4: Reconstructed images generated during training from model with $\beta$ values ranging from 0.25 to 0.9. For each $\beta$ the upper row is the Ground Truth image and the bottom is the reconstructed image
  • Figure 5: Samples generated to demonstrate robustness of the CVAE mdoel