Table of Contents
Fetching ...

Controlled Face Manipulation and Synthesis for Data Augmentation

Joris Kirchner, Amogh Gudi, Marian Bittner, Chirag Raman

TL;DR

A facial manipulation method that operates in the semantic latent space of a pre-trained face generator (Diffusion Autoencoder) that reduces entanglement of semantic features via dependency-aware conditioning that accounts for AU co-activation and orthogonal projection that removes nuisance attribute directions.

Abstract

Deep learning vision models excel with abundant supervision, but many applications face label scarcity and class imbalance. Controllable image editing can augment scarce labeled data, yet edits often introduce artifacts and entangle non-target attributes. We study this in facial expression analysis, targeting Action Unit (AU) manipulation where annotation is costly and AU co-activation drives entanglement. We present a facial manipulation method that operates in the semantic latent space of a pre-trained face generator (Diffusion Autoencoder). Using lightweight linear models, we reduce entanglement of semantic features via (i) dependency-aware conditioning that accounts for AU co-activation, and (ii) orthogonal projection that removes nuisance attribute directions (e.g., glasses), together with an expression neutralization step to enable absolute AU edit. We use these edits to balance AU occurrence by editing labeled faces and to diversify identities/demographics via controlled synthesis. Augmenting AU detector training with the generated data improves accuracy and yields more disentangled predictions with fewer co-activation shortcuts, outperforming alternative data-efficient training strategies and suggesting improvements similar to what would require substantially more labeled data in our learning-curve analysis. Compared to prior methods, our edits are stronger, produce fewer artifacts, and preserve identity better.

Controlled Face Manipulation and Synthesis for Data Augmentation

TL;DR

A facial manipulation method that operates in the semantic latent space of a pre-trained face generator (Diffusion Autoencoder) that reduces entanglement of semantic features via dependency-aware conditioning that accounts for AU co-activation and orthogonal projection that removes nuisance attribute directions.

Abstract

Deep learning vision models excel with abundant supervision, but many applications face label scarcity and class imbalance. Controllable image editing can augment scarce labeled data, yet edits often introduce artifacts and entangle non-target attributes. We study this in facial expression analysis, targeting Action Unit (AU) manipulation where annotation is costly and AU co-activation drives entanglement. We present a facial manipulation method that operates in the semantic latent space of a pre-trained face generator (Diffusion Autoencoder). Using lightweight linear models, we reduce entanglement of semantic features via (i) dependency-aware conditioning that accounts for AU co-activation, and (ii) orthogonal projection that removes nuisance attribute directions (e.g., glasses), together with an expression neutralization step to enable absolute AU edit. We use these edits to balance AU occurrence by editing labeled faces and to diversify identities/demographics via controlled synthesis. Augmenting AU detector training with the generated data improves accuracy and yields more disentangled predictions with fewer co-activation shortcuts, outperforming alternative data-efficient training strategies and suggesting improvements similar to what would require substantially more labeled data in our learning-curve analysis. Compared to prior methods, our edits are stronger, produce fewer artifacts, and preserve identity better.
Paper Structure (18 sections, 1 equation, 14 figures, 1 table)

This paper contains 18 sections, 1 equation, 14 figures, 1 table.

Figures (14)

  • Figure 1: Distribution of AU occurrence labels in datasets. Naturally occurring distribution of AU labels in real data (DISFA DISFADISFA+) is highly skewed (black). In contrast, our method for controlled editing/synthesis allows for generation of datasets with a balanced distribution (green).
  • Figure 2: Overview of our method. (Top-Left) Learning linear edit directions on semantic codes, where AU models are conditioned on other AUs. Afterwards, AU directions are projected on other possible nuisance attributes. (Top-Right) Editing existing images: encode $(x_T,z)$, pick target AU, obtain disentangled direction $w$, set $z\leftarrow z+s\,w$, decode with original $x_T$. (Bottom) Synthesizing new faces: sample $(x_T,z_{\text{sample}})$ from DiffAE, optionally accept by demographic predictors, neutralize $z_{\text{sample}}$ via the neutralization model $\mathcal{N}$ by optimizing Equation \ref{['eq:optimization_objective']}, then edit and decode.
  • Figure 3: An illustrative DAG showing an example of AU relations. The intensity of AU 1 is defined by the reconstructed image, but also by AU 2 as they are often activated together. Conditioning on a correlated AU blocks leakage from that AU to the target AU. At the same time attributes that are influenced by AU1, called colliders (e.g., surprised), should not be conditioned upon to avoid opening spurious paths.
  • Figure 4: Examples of controlled AU editing (top half) and controlled synthesis of new faces (bottom half). Editing examples (top) uses existing neutral images from the DISFA dataset DISFA, where reconstruction is obtained by encoding and decoding without editing. Synthesis examples (bottom) contains random identities sampled from DiffAE containing arbitrary facial expressions. which is first neutralized by the non-linear procedure to suppress pre-existing expressions. Thereafter, each AU is modified by $+1$ using the proposed method, corresponding to FACS activation level E. Editing of images is largely without major artifacts, and neutralization succeeds in deactivating AUs; but complex expressions may leave small residuals, and editing may slightly alter background of synthetic images.
  • Figure 5: (Right) Comparison of correlation between detected AUs in the real DISFA data (top/red) and the edited/synthetic data (bottom/blue), as estimated by an AU detection tool (FaceReader facereader) as a stand-in for human annotators. The color heatmap represents the amplitude of the correlation, while the green/yellow/red markers in the edited/synthetic data table denote the intensity of change w.r.t. correlations in the real data (green = reduction, red = increase, yellow = no significant change). Detected AUs in the edited/synthetic data are much less correlated with each other on average (0.09) as compared to the detected AUs in the real data (0.16). (Left) For reference, the inter-AU correlations between the ground truth labels in DISFA are also shown. These exhibit similar patterns as the detector's AU estimates on the real images, suggesting limited influence of detector bias.
  • ...and 9 more figures