Table of Contents
Fetching ...

Unveiling Hidden Vulnerabilities in Digital Human Generation via Adversarial Attacks

Zhiying Li, Yeying Jin, Fan Shen, Zhi Liu, Weibin Chen, Pengju Zhang, Xiaomei Zhang, Boyu Chen, Michael Shen, Kejian Wu, Zhaoxin Fan, Jin Dong

TL;DR

This work reveals critical security vulnerabilities in expressive human pose and shape estimation (EHPS) by introducing Tangible Attack (TBA), a framework that crafts adversarial perturbations to disrupt EHPS across models. TBA uses a Dual Heterogeneous Noise Generator (DHNG) combining a Variational Autoencoder and ControlNet, optimized with a novel adversarial loss and multi-gradient PGD to produce effective, controllable noise. Extensive experiments on 3DPW and UBody show consistent, substantial degradation in pose and shape estimation across state-of-the-art EHPS models, highlighting urgent needs for defenses and robust design. The study underscores the practical risk of adversarial perturbations in digital human generation and motivates future work on imperceptible attacks and certifiable robustness.

Abstract

Expressive human pose and shape estimation (EHPS) is crucial for digital human generation, especially in applications like live streaming. While existing research primarily focuses on reducing estimation errors, it largely neglects robustness and security aspects, leaving these systems vulnerable to adversarial attacks. To address this significant challenge, we propose the \textbf{Tangible Attack (TBA)}, a novel framework designed to generate adversarial examples capable of effectively compromising any digital human generation model. Our approach introduces a \textbf{Dual Heterogeneous Noise Generator (DHNG)}, which leverages Variational Autoencoders (VAE) and ControlNet to produce diverse, targeted noise tailored to the original image features. Additionally, we design a custom \textbf{adversarial loss function} to optimize the noise, ensuring both high controllability and potent disruption. By iteratively refining the adversarial sample through multi-gradient signals from both the noise and the state-of-the-art EHPS model, TBA substantially improves the effectiveness of adversarial attacks. Extensive experiments demonstrate TBA's superiority, achieving a remarkable 41.0\% increase in estimation error, with an average improvement of approximately 17.0\%. These findings expose significant security vulnerabilities in current EHPS models and highlight the need for stronger defenses in digital human generation systems.

Unveiling Hidden Vulnerabilities in Digital Human Generation via Adversarial Attacks

TL;DR

This work reveals critical security vulnerabilities in expressive human pose and shape estimation (EHPS) by introducing Tangible Attack (TBA), a framework that crafts adversarial perturbations to disrupt EHPS across models. TBA uses a Dual Heterogeneous Noise Generator (DHNG) combining a Variational Autoencoder and ControlNet, optimized with a novel adversarial loss and multi-gradient PGD to produce effective, controllable noise. Extensive experiments on 3DPW and UBody show consistent, substantial degradation in pose and shape estimation across state-of-the-art EHPS models, highlighting urgent needs for defenses and robust design. The study underscores the practical risk of adversarial perturbations in digital human generation and motivates future work on imperceptible attacks and certifiable robustness.

Abstract

Expressive human pose and shape estimation (EHPS) is crucial for digital human generation, especially in applications like live streaming. While existing research primarily focuses on reducing estimation errors, it largely neglects robustness and security aspects, leaving these systems vulnerable to adversarial attacks. To address this significant challenge, we propose the \textbf{Tangible Attack (TBA)}, a novel framework designed to generate adversarial examples capable of effectively compromising any digital human generation model. Our approach introduces a \textbf{Dual Heterogeneous Noise Generator (DHNG)}, which leverages Variational Autoencoders (VAE) and ControlNet to produce diverse, targeted noise tailored to the original image features. Additionally, we design a custom \textbf{adversarial loss function} to optimize the noise, ensuring both high controllability and potent disruption. By iteratively refining the adversarial sample through multi-gradient signals from both the noise and the state-of-the-art EHPS model, TBA substantially improves the effectiveness of adversarial attacks. Extensive experiments demonstrate TBA's superiority, achieving a remarkable 41.0\% increase in estimation error, with an average improvement of approximately 17.0\%. These findings expose significant security vulnerabilities in current EHPS models and highlight the need for stronger defenses in digital human generation systems.

Paper Structure

This paper contains 26 sections, 12 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Under normal conditions, a clean image generates a realistic digital human with the correct posture. During an attack, adversarial samples cause significant deviations in the model’s output, resulting in risky posture.
  • Figure 2: The overflow of the TBA framework is as follows: 1) Dual Heterogeneous Noise Generators: Utilizing a combination of VAE and ControlNet, this component generates diverse and targeted noise tailored to the characteristics of the original image. 2) Noise Optimization: The generated noise is refined through the application of a novel adversarial loss function and the gradients of the SMPLer-X model. 3) Enhancement: The attack efficacy of the noise is further amplified through iterative optimization using the PGD attack method.
  • Figure 3: Visualizing Clean vs. Adversarial (Adv) samples of 3DPW for digital human generation under state-of-the-art EHPS models.
  • Figure 4: Visualizing Clean vs. Adversarial (Adv) samples of UBody for digital human generation under state-of-the-art EHPS models.
  • Figure 5: Comparison of clean and adversarial samples under PW3D: the gap between each pose parameter inferred by SMPLer-X models and the ground truth parameters.
  • ...and 2 more figures